The "IIT Admissions Dataset" dataset provides valuable information about 200,000 students who have applied for admissions to Indian Institutes of Technology (IITs). It includes details such as the field of study, specialization, fees, and discounts offered to the students.We'll try to answer the following questions:

How many students were admitted each year?
What is the distribution of students across different field study?
How many students are enrolled in each specialization?
What is the average age of students in each field of study?
What is the current semester distribution among all students?
Are there any correlations between the field of study and the fees paid?
Are there any correlations between the fees and the discount offered on the fees?

In [1]:
import pandas as pd 
import numpy as np


In [2]:
df = pd.read_csv("../input/student_data.csv")

In [3]:
df = df
intended_df_size_in_MB = 256
factor = intended_df_size_in_MB*(2**20)//df.memory_usage(index=True).sum()
if factor > 0:
    df = pd.concat([df]*factor, ignore_index=True)
df = df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3200000 entries, 0 to 3199999
Data columns (total 10 columns):
 #   Column                       Dtype 
---  ------                       ----- 
 0   Student ID                   int64 
 1   Student Name                 object
 2   Date of Birth                object
 3   Field of Study               object
 4   Year of Admission            int64 
 5   Expected Year of Graduation  int64 
 6   Current Semester             int64 
 7   Specialization               object
 8   Fees                         int64 
 9   Discount on Fees             int64 
dtypes: int64(6), object(4)
memory usage: 244.1+ MB


In [3]:
df.head()

Unnamed: 0,Student ID,Student Name,Date of Birth,Field of Study,Year of Admission,Expected Year of Graduation,Current Semester,Specialization,Fees,Discount on Fees
0,165527,Bryan Rogers,2006-01-19,Computer Science,2020,2017,3,Web Development,155152,19572
1,635763,James Hogan,1999-05-23,Mechanical Engineering,2020,2020,2,Machine Learning,157870,14760
2,740021,David Robinson,1997-12-02,Civil Engineering,2017,2022,1,Network Security,55662,5871
3,433076,Susan Miller,1999-10-30,Computer Science,2021,2019,1,Data Science,134955,17284
4,441628,Brittany Martin,1998-01-10,Chemical Engineering,2016,2018,1,Network Security,125934,14871


In [4]:
df.columns =['student_id','student_name', 'DOB', 'field_of_study', 'year_of_admission', 'expected_graduation', 'current_sem', 'specialization', 'fees', 'discount']
df.describe()

Unnamed: 0,student_id,year_of_admission,expected_graduation,current_sem,fees,discount
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,549367.492925,2018.997685,2019.995235,2.49902,125092.847595,12484.258575
std,259361.565011,2.002381,1.997744,1.117804,43287.894903,8788.362629
min,100001.0,2016.0,2017.0,1.0,50000.0,0.0
25%,325311.0,2017.0,2018.0,1.0,87641.5,5383.0
50%,548855.5,2019.0,2020.0,2.0,125221.0,10792.5
75%,774182.5,2021.0,2022.0,3.0,162597.25,18154.0
max,999997.0,2022.0,2023.0,4.0,200000.0,39865.0


In [5]:
#Check for missing values 
df.isna().any()

student_id             False
student_name           False
DOB                    False
field_of_study         False
year_of_admission      False
expected_graduation    False
current_sem            False
specialization         False
fees                   False
discount               False
dtype: bool

In [6]:
#Check for duplicated rows
df.duplicated().value_counts()

False    200000
dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   student_id           200000 non-null  int64 
 1   student_name         200000 non-null  object
 2   DOB                  200000 non-null  object
 3   field_of_study       200000 non-null  object
 4   year_of_admission    200000 non-null  int64 
 5   expected_graduation  200000 non-null  int64 
 6   current_sem          200000 non-null  int64 
 7   specialization       200000 non-null  object
 8   fees                 200000 non-null  int64 
 9   discount             200000 non-null  int64 
dtypes: int64(6), object(4)
memory usage: 15.3+ MB


 Creating a cross-tabulation of 'field_of_study' and 'specialization'

In [8]:
cross_tab = pd.crosstab(df['field_of_study'], df['specialization'])
cross_tab

specialization,Artificial Intelligence,Data Science,Machine Learning,Network Security,Web Development
field_of_study,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chemical Engineering,7945,7955,7924,8040,8156
Civil Engineering,7864,7925,7880,8076,8029
Computer Science,7900,8018,8131,7887,8024
Electrical Engineering,8058,8032,8201,8028,7986
Mechanical Engineering,7997,7940,8006,7901,8097


In [9]:
#Distribution of students across different 'field_of_study'
counts = df['field_of_study'].value_counts()


In [10]:
#Students enrolled in each specialization
df['specialization'].value_counts()

Web Development            40292
Machine Learning           40142
Network Security           39932
Data Science               39870
Artificial Intelligence    39764
Name: specialization, dtype: int64

In [11]:
#Exploring any relationship between the field of study and the average age of the students
df['DOB'] = pd.to_datetime(df['DOB'])
df['age'] = (pd.to_datetime('today') - df['DOB'])

#Calculate the age based on current date
current_year = pd.to_datetime('today').year
df['age'] = current_year - df['DOB'].dt.year

df.groupby('field_of_study')['age'].mean()

field_of_study
Chemical Engineering      21.585157
Civil Engineering         21.582240
Computer Science          21.570320
Electrical Engineering    21.585610
Mechanical Engineering    21.590571
Name: age, dtype: float64

In [12]:
#Students admitted each year
admission_counts = df['year_of_admission'].value_counts().sort_index()
print(admission_counts)

2016    28646
2017    28760
2018    28435
2019    28618
2020    28355
2021    28483
2022    28703
Name: year_of_admission, dtype: int64


In [13]:
#Current semester distribution among the students
semester_counts = df['current_sem'].value_counts().sort_index()

In [14]:
#Exploring the correlation between the field of study and the fees paid
selected_columns = df[['field_of_study','fees']]

#Converting the values in 'field_of_study' to numerical values
field_map = {'Chemical Engineering' : 0, 'Civil Engineering' : 1, 'Computer Science' : 2, 'Electrical Engineering' :3, 'Mechanical Engineering' :4}
selected_columns['field_of_study'] = selected_columns['field_of_study'].map(field_map)

corr_matrix = selected_columns.corr()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns['field_of_study'] = selected_columns['field_of_study'].map(field_map)


Based on the correlation coefficient of -0.0017, we can conclude that there is no notable relationship between the 'field_of_study' and 'fees' in the dataset.

In [15]:
#Exploring the correlation between fees and the discount on the fees
correlation_matrix = df[['fees','discount']].corr()

From the above heatmap, the correlation coefficient of 0.49 suggests a moderate positive correlation between the fees and the discount offered on the fees. 