# Step 3: Exploratory Data Analysis

## 1. Introduction

   ***Project Goal:*** To develop a model that can predict the number of awards per 100 Full-Time Undergraduates with high accuracy.

   ***Target Variable:*** awards_per_value: The number of awards per 100 full-time undergraduates.

   ***Features:***

   - **chronname:** The name of the college or university.
   - **city:** The city in which the college is located.
   - **state:** The state in which the college is located.
   - **control:** The type of college (public or private).
   - **basic:** A flag indicating whether the college is a basic institution.
   - **hbcu**: A flag indicating whether the college is a historically black college or university.
   - **flagship:** A flag indicating whether the college is a flagship institution.
   - **long_x:** The longitude of the college.
   - **lat_y:** The latitude of the college.
   - **site:** The website of the college.
   - **student_count:** The number of students enrolled at the college.
   - **awards_per_state_value:** The number of awards per 100 full-time undergraduates compared to the state average.
   - **awards_per_natl_value:** The number of awards per 100 full-time undergraduates compared to the national average.
   - **exp_award_value:** The amount of money spent per award.
   - **exp_award_state_value:** The amount of money spent per award compared to the state average.
   - **exp_award_natl_value:** The amount of money spent per award compared to the national average.
   - **exp_award_percentile:** The percentile of the amount of money spent per award compared to other colleges.
   - **ft_pct:** The percentage of full-time students.
   - **fte_value:** The number of full-time equivalent students.
   - **fte_percentile:** The percentile of the number of full-time equivalent students compared to other colleges.
   - **med_sat_value:** Median SAT score among first time students.
   - **med_sat_percentile:** The percentage of SAT scores
   - **aid_value:** The average amount in financial aid.
   - **aid_percentile:** The financial aid percentage.
   - **endow_value:** The endowment values.
   - **endow_percentile:** Percentage in endowment.
   - **grad_100_value:** The number of students who graduated within 100% of normal time.
   - **grad_100_percentile:** The percentage of students who graduated within 100% of normal time.
   - **grad_150_value:** The number of students who graduated within 150% of normal time.
   - **grad_150_percentile:** The percentage of students who graduated within 150% of normal time.
   - **pell_value:** The amount of funds under Pell grant.
   - **pell_percentile:** The percentage of Pell grant.
   - **retain_value:** The number of students retained in the same institution.
   - **retain_percentile:** The retention percentage.
   - **similar:** A flag indicating whether the college is similar to other colleges in the dataset.
   - **state_sector_ct:** The number of colleges in the same state and sector.
   - **carnegie_ct:** The number of colleges in the same Carnegie classification.
   - **counted_pct:** The percentage of students who are counted in the dataset.
   - **cohort_size:** The size of the cohort.

## 2. Imported Packages and Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import missingno as msno

## 3. Load Data

In [2]:
data = pd.read_csv('collegedata.csv')

In [3]:
data.shape

(3798, 44)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3798 entries, 0 to 3797
Data columns (total 44 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   unitid                  3798 non-null   int64  
 1   chronname               3798 non-null   object 
 2   city                    3798 non-null   object 
 3   state                   3798 non-null   object 
 4   level                   3798 non-null   object 
 5   control                 3798 non-null   object 
 6   basic                   3798 non-null   object 
 7   hbcu                    3798 non-null   object 
 8   flagship                3798 non-null   object 
 9   long_x                  3798 non-null   float64
 10  lat_y                   3798 non-null   float64
 11  site                    3798 non-null   object 
 12  student_count           3798 non-null   int64  
 13  awards_per_value        3798 non-null   float64
 14  awards_per_state_value  3798 non-null   

## 4. Categorical Features