**1. Import Necessary Libraries**

In [1]:
import pandas as pd

**2. To read the data into Python.**

In [2]:
college = pd.read_csv('cleaned_college_data.csv')
print(college.head())

                        College Private  Apps  Accept  Enroll  Top10perc  \
0  Abilene Christian University     yes  1660    1232     721         23   
1            Adelphi University     yes  2186    1924     512         16   
2                Adrian College     yes  1428    1097     336         22   
3           Agnes Scott College     yes   417     349     137         60   
4     Alaska Pacific University     yes   193     146      55         16   

   Top25perc  Fundergrad  Pundergrad  Outstate  RoomBoard  Books  Personal  \
0         52        2885         537      7440       3300    450      2200   
1         29        2683        1227     12280       6450    750      1500   
2         50        1036          99     11250       3750    400      1165   
3         89         510          63     12960       5450    450       875   
4         44         249         869      7560       4120    800      1500   

   PhD  Terminal  SFRatio  percalumni  Expend  GradRate  
0   70        78

**3. To produce a numerical summary of the variables in the data set.**

In [3]:
college.describe()

Unnamed: 0,Apps,Accept,Enroll,Top10perc,Top25perc,Fundergrad,Pundergrad,Outstate,RoomBoard,Books,Personal,PhD,Terminal,SFRatio,percalumni,Expend,GradRate
count,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0,776.0
mean,3000.548969,2016.981959,780.298969,27.582474,55.823454,3703.373711,856.385309,10442.030928,4356.904639,549.315722,1341.725515,72.725515,79.744845,14.089433,22.747423,9662.701031,65.395619
std,3872.578376,2452.16802,929.731001,17.639133,19.803448,4852.585693,1523.112196,4025.431964,1097.266697,165.201826,676.833991,16.236897,14.684883,3.960895,12.399401,5224.659733,17.084743
min,81.0,72.0,35.0,1.0,9.0,139.0,1.0,2340.0,1780.0,96.0,250.0,8.0,24.0,2.5,0.0,3186.0,10.0
25%,776.0,603.25,242.0,15.0,41.0,991.0,95.0,7305.0,3595.75,469.5,865.0,62.0,71.0,11.5,13.0,6749.25,53.0
50%,1557.5,1109.5,434.0,23.0,54.0,1707.5,354.0,9990.0,4197.5,500.0,1200.0,75.0,82.0,13.6,21.0,8392.5,65.0
75%,3603.0,2407.5,902.25,35.0,69.0,4030.25,967.25,12931.25,5050.0,600.0,1700.0,85.0,92.0,16.5,31.0,10838.5,78.0
max,48094.0,26330.0,6392.0,96.0,100.0,31643.0,21836.0,21700.0,8124.0,2340.0,6800.0,103.0,100.0,39.8,64.0,56233.0,100.0


* All attributes have 777 entries, indicating **no missing values** for any of the numerical features.
* For application attribute `Apps`,
    - min = 81,   mean = 3001,    max=48094, std=3870
    - since mean is 30001 and max=48094, there must be potential **outliers** in the `Apps` attribute
* attribute `Accept`, `Enroll`, `Fundergrad`, `Pundergrad`, `Outstate`, `Expend` are suspected to be outliers
* Also, Attribute `GradRate` has maximum value of 118 typically says that there is obvious data errors since graduation rate cannot exceed 100%

**4. Handle Data error**

In [8]:
index_to_drop = college[college['GradRate'] > 100].index
college = college.drop(index_to_drop)

print(f"Original DataFrame shape: {college.shape}")

Original DataFrame shape: (776, 19)


In [4]:
outlier_suspected_columns = ['Apps', 'Accept', 'Enroll', 'Fundergrad', 'Pundergrad', 'Outstate', 'Expend']

**5. Coefficient Of Variation**

In [5]:
print("--- Coefficient of Variation (CV) ---")
for col in outlier_suspected_columns:
    mean = college[col].mean()
    std = college[col].std()
    if mean != 0:
        cv = (std / mean) * 100
        print(f"CV for {col}: {cv:.2f}%")
    else:
        print(f"CV for {col}: Mean is zero, cannot calculate CV.")

--- Coefficient of Variation (CV) ---
CV for Apps: 129.06%
CV for Accept: 121.58%
CV for Enroll: 119.15%
CV for Fundergrad: 131.03%
CV for Pundergrad: 177.85%
CV for Outstate: 38.55%
CV for Expend: 54.07%


**6. Median and Mode**

In [6]:
print("--- Median and Mode ---")
for col in outlier_suspected_columns:
    median = college[col].median()
    mode = college[col].mode()
    mean = college[col].mean()
    print(f"{col}:")
    print(f"  Mean: {mean:.2f}")
    print(f"  Median: {median:.2f}")
    if not mode.empty:
        print(f"  Mode: {', '.join(mode.astype(str).tolist())}")
    else:
        print("  Mode: No unique mode found (or multiple modes)")
print("\n")

--- Median and Mode ---
Apps:
  Mean: 3000.55
  Median: 1557.50
  Mode: 440, 663, 1006
Accept:
  Mean: 2016.98
  Median: 1109.50
  Mode: 452
Enroll:
  Mean: 780.30
  Median: 434.00
  Mode: 177, 295
Fundergrad:
  Mean: 3703.37
  Median: 1707.50
  Mode: 500, 662, 959, 1115, 1306, 1345, 1707
Pundergrad:
  Mean: 856.39
  Median: 354.00
  Mode: 30
Outstate:
  Mean: 10442.03
  Median: 9990.00
  Mode: 6550
Expend:
  Mean: 9662.70
  Median: 8392.50
  Mode: 4900, 5935, 6333, 6413, 6433, 6562, 6716, 6719, 6898, 6971, 7041, 7114, 7309, 7348, 7762, 7881, 7940, 8118, 8135, 8189, 8324, 8355, 8604, 8686, 8847, 8954, 9084, 9158, 9209, 9431, 10872, 10912, 10922




**7. Skewness Coefficient**

In [7]:
print("--- Skewness Coefficient ---")
for col in outlier_suspected_columns:
    skewness = college[col].skew()
    print(f"Skewness for {col}: {skewness:.2f}")
    if skewness > 0.5:
        print(f"  - {col} is highly right-skewed.")
    elif skewness < -0.5:
        print(f"  - {col} is highly left-skewed.")
    elif skewness >= -0.5 and skewness <= 0.5:
        print(f"  - {col} is fairly symmetrical.")
    else:
        print(f"  - {col} shows moderate skewness.")

--- Skewness Coefficient ---
Skewness for Apps: 3.72
  - Apps is highly right-skewed.
Skewness for Accept: 3.42
  - Accept is highly right-skewed.
Skewness for Enroll: 2.69
  - Enroll is highly right-skewed.
Skewness for Fundergrad: 2.61
  - Fundergrad is highly right-skewed.
Skewness for Pundergrad: 5.69
  - Pundergrad is highly right-skewed.
Skewness for Outstate: 0.51
  - Outstate is highly right-skewed.
Skewness for Expend: 3.46
  - Expend is highly right-skewed.


**Conclusion of 4,5,6th steps: Presence Of Potential Outliers**

Since,
* Coefficient of Variation of every attribute is higher,
* comparison of Mean, Median and Mode:
    - For `Apps`, Mean > Median > Mode
    - For `Accept`, Mean > Median > Mode
    - For `Enroll`, Mean > Median > Mode
    - For `Fundergrad`, Mean > Median > Mode
    - For `Pundergrad`, Mean > Median > Mode
    - For `Outstate`, Mean > Median > Mode
* Skewness Coefficient for every attribute further shows their values are higher than 0.5

We conclude that all these attributes have **potential outliers** and **right-Skewed**. Meaning a few large institutions receive a disproportionately high number.

However, for Expend, multiple values are tied for the highest frequency. This suggests that the mode is not a particularly informative measure of central tendency for this continuous-like attribute. Though, mean > median, we will ensure its skewness through visualizations (histograms and box plots).

In [9]:
college.to_csv('error_free_college_data.csv', index=False)

----
**FURTHER EXPLORATIONS WILL BE IN EXPLORATORY DATA ANALYSIS (EDA)**

-----