**Step - 1** 
# Introduction 
 - Give a detailed data description and objective

The dataset was released by Aspiring Minds from the Aspiring Mind Employment Outcome 2015 (AMEO). The study is primarily limited  only to students with engineering disciplines. The dataset contains the employment outcomes of engineering graduates as dependent variables (Salary, Job Titles, and Job Locations) along with the standardized scores from three different areas – cognitive skills, technical skills and personality skills. The dataset also contains demographic features. The dataset  contains  around  40 independent variables and 4000 data points. The independent variables are both continuous and categorical in nature. The dataset contains a unique identifier for each candidate. Below mentioned table contains the details for the original dataset.

### VARIABLES, TYPE, Description 

- **ID**, UID, A unique ID to identify a candidate 

- **Salary**, Continuous, Annual CTC oﬀered to the candidate (in INR) 

- **DOJ**, Date, Date of joining the company 
 
- **DOL**, Date, Date of leaving the company 

- **Designation**, Categorical, Designation oﬀered in the job 

- **JobCity**, Categorical, Location of the job (city) 

- **Gender**, Categorical, Candidate’s gender 

- **DOB**, Date, Date of birth of candidate 

- **10percentage**, Continuous, Overall marks obtained in grade 10examinations 

- **10board**, Continuous, The school board whose curriculum the candidate followed in grade 10 

- **12graduation**, Date, Year of graduation - senior year high school 

- **12percentage**, Continuous, Overall marks obtained in grade 12examinations 

- **12board**, Date, The school board whose curriculum the candidate followed in grade 12 

- **CollegeID**, NA/ID, Unique ID identifying the college which the candidate attended 

- **CollegeTier**, Categorical, Tier of college 

- **Degree**, Categorical, Degree obtained/pursued by the candidate 

- **Specialization****, Categorical, Specialization pursued by the candidate 

- **CollegeGPA**, Continuous, Aggregate GPA at graduation 

- **CollegeCityID**, NA/ID, A unique ID to identify the city in which the college is located in 

- **CollegeCityTier**, Categorical, The tier of the city in which the college is located 

- **CollegeState**, Categorical, Name of States 

- **GraduationYear**, Date, Year of graduation (Bachelor’s degree) 

- **English**, Continuous, Scores in AMCAT English section 

- **Logical**, Continuous, Scores in AMCAT Logical section 

- **Quant**, Continuous, Scores in AMCAT Quantitative section 

- **Domain**, Continuous/ Standardized, Scores in AMCAT’s domain module 

- **ComputerProgramming**, Continuous, Score in AMCAT’s Computer programmingsection 

- **ElectronicsAndSemicon**, Continuous, Score in AMCAT’s Electronics & Semiconductor Engineering section 

- **ComputerScience**, Continuous,  Score in AMCAT’s Computer Science section 

- **MechanicalEngg**, Continuous, Score in AMCAT’s Mechanical Engineeringsection 

- **ElectricalEngg**, Continuous, Score in AMCAT’s Electrical Engineering section 

- **TelecomEngg**, Continuous, Score in AMCAT’s Telecommunication Engineering section 

- **CivilEngg**, Continuous, Score in AMCAT’s Civil Engineering section 

- **conscientiousness**, Continuous/ Standardized, Scores in one of the sections of AMCAT’s personality test 

- **agreeableness**, Continuous/Standardized, Scores in one of the sections of AMCAT’spersonality test 

- **extraversion**, Continuous/ Standardized, Scores in one of the sections of AMCAT’s personality test 

- **neuroticism**, Continuous/Standardized, Scores in one of the sections of AMCAT’spersonality test 

- **openess_to_experience**, Continuous/ Standardized, Scores in one of the sections of AMCAT’s personality test 

**Step - 2** 
- Import the data and display the head, shape and description of the data.

In [None]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
import random
from math import sqrt
from scipy.stats import t,norm,chi2,chi2_contingency

pd.set_option('display.max_columns',None)      # For Showing all columns
warnings.filterwarnings("ignore")              # For removing the warnings [Not recommended!!!]
pd.set_option('display.float_format',lambda x: '%.3f' %x)
%matplotlib inline

In [None]:
df=pd.read_excel("aspiring_minds_employability_outcomes_2015.xlsx",index_col='ID')
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe(include='all')

In [None]:
df.isna().sum()

In [None]:
df['DOL']=df['DOL'].str.replace('present',str(dt.now()))
df['DOL']=pd.to_datetime(df['DOL'])
#df.head()

In [None]:
df.drop(columns="Unnamed: 0",inplace=True)
df.head(1)

**Step - 3** 
# Univariate Analysis 
- PDF, Histograms, Boxplots, Countplots, etc..
-	Find the outliers in each numerical column
-	Understand the probability and frequency distribution of each numerical column
-	Understand the frequency distribution of each categorical Variable/Column
-	Mention observations after each plot.

In [None]:
sns.distplot(df.Salary, kde_kws={"color": "k"})

In [None]:
"Average Salary of all employees salary is {:.2f}".format(df.Salary.mean())

In [None]:
sns.distplot(df['10percentage'],
                  kde_kws={"color": "k","label": "KDE"}
                  )

In [None]:
plt.figure(figsize=(15,15))
sns.distplot(df['12graduation'])
plt.xticks(rotation=24)
plt.show()

In [None]:
sns.distplot(df['12percentage'])

In [None]:
sns.distplot(df['collegeGPA'])

In [None]:
sns.distplot(df['English'])

In [None]:
sns.distplot(df['Logical'])

In [None]:
sns.distplot(df['Quant'])

In [None]:
sns.distplot(df['Domain'])

In [None]:
sns.distplot(df['ComputerProgramming'])

In [None]:
sns.distplot(df['ElectronicsAndSemicon'])

In [None]:
sns.distplot(df['ComputerScience'])

In [None]:
sns.distplot(df['MechanicalEngg'])

In [None]:
sns.distplot(df['ElectricalEngg'])

In [None]:
sns.distplot(df['TelecomEngg'])

In [None]:
sns.distplot(df['CivilEngg'])

In [None]:
sns.distplot(df['conscientiousness'])

In [None]:
sns.distplot(df['agreeableness'])

In [None]:
sns.distplot(df['extraversion'])

In [None]:
sns.distplot(df['nueroticism'])

In [None]:
sns.distplot(df['openess_to_experience'])

In [None]:
specialization_freq = df['Specialization'].value_counts()[:15]
specialization_freq.plot(kind='bar', figsize=(15,5))

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(df['Designation'],order=df['Designation'].value_counts().iloc[:9].index)

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(df['JobCity'],order=df['JobCity'].value_counts().iloc[:9].index)

In [None]:
sns.countplot(df['Gender'])
plt.xticks(ticks=[0,1],labels=["Female","Male"])

In [None]:
"Female percentage in the dataset {}".format( round((df.Gender[df.Gender=='f'].count()/len(df.Gender))*100,2))

In [None]:
plt.figure(figsize=(20,10))
graph=sns.countplot(df['10board'],order=df['10board'].value_counts().iloc[:9].index)
graph.set_xticklabels(list(df['10board'].value_counts().iloc[:9].index),fontdict={'fontsize':'14'})
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(df['12board'],order=df['12board'].value_counts().iloc[:9].index)

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(df['Degree'])

**Step - 4** 
# Bivariate Analysis
-	Discover the relationships between numerical columns using Scatter plots, hexbin plots, pair plots, etc..
-	Identify the patterns between categorical and numerical columns using swarmplot, boxplot, barplot, etc..
-	Mention observations after each plot.

In [None]:
sns.scatterplot(y = df['Salary'], x = df['10percentage'], hue = df['Gender'])

In [None]:
sns.scatterplot(y = df['Salary'], x = df['12percentage'], hue = df['Gender'])

In [None]:
sns.scatterplot(y = df['Salary'], x = df['collegeGPA'], hue = df['Gender'])

In [None]:
d=df.groupby('Specialization').agg({'Salary':sum})
d.nlargest(10,['Salary']).index

In [None]:
plt.figure(figsize=(20,10))
graph=sns.barplot(x='Specialization',y='Salary',data=df,order=d.nlargest(10,['Salary']).index)
plt.xticks(rotation=90)
plt.show()

**Step - 5** 
# Research Questions
    5.1   	Times of India article dated Jan 18, 2019 states that “After doing your Computer Science Engineering if you take up jobs as a Programming Analyst, Software Engineer, Hardware Engineer and Associate Engineer you can earn up to 2.5-3 lakhs as a fresh graduate.” Test this claim with the data given to you.
    

In [None]:
data=df[['Specialization','GraduationYear','DOJ','Designation','Salary']]
data.reset_index(inplace=True)
data.drop('ID',axis=1,inplace=True)
print(data.shape)
data.head()


In [None]:
data.info()

In [None]:
# Filtering Based on Requirement

# Fresher i.e., where Graduation Year and Date of Joining is Same
data=data[data.GraduationYear==data.DOJ.dt.year]

# where the Specialization is 'computer science & engineering'
data=data[data.Specialization=='computer science & engineering']

# Designation is 'programmer analyst','software engineer','associate engineer','hardware engineer'
final_data=data[data.Designation.isin(['programmer analyst','software engineer','associate engineer','hardware engineer'])]

final_data

In [None]:
pop_mean=np.mean(final_data.Salary)
print(pop_mean)

**Step - 1** :

- Alternate hypothesis (Bold Claim) $ \\ $ Cannot earn upto 2.75 lakhs
  $$ H_1 : \mu <275000 $$

- Null Hypothesis (Status Quo) $ \\ $    earn up to 2.5-3 lakhs [taking average to 2.75 lakhs]
    $$ H_0 :   \mu >=275000  $$

**Step 2** :
- Collection sample of Sample Size n
- Compute Sample Mean : $ \bar x $ = ?

In [None]:
n=20 # Sample Size
sample_salary=random.sample(list(final_data.Salary),n)
print(sample_salary)
sample_mean=np.mean(sample_salary)
print(sample_mean)
sample_stddev=np.std(sample_salary)
print(sample_stddev)


**Step 3**:
 Compute t- Statistic (as population Standard Deviation is not Given):
 $$ t_{n-1,\frac{\alpha}{2}} = \frac {\bar x - \mu_{\bar x}}{\frac{s}{\sqrt n}} $$

In [None]:
# Calculating t-score
t_score= ((sample_mean -pop_mean) / (sample_stddev/(sqrt(n))))
print(t_score)

**Step 4**:
Decide $\alpha$ or significance level and $degree \ of \ freedom $

In [None]:
# T Critical or T-Tabulated 
# Right Tail Test
confidence_level=0.95
deg_of_freedom=n-1
alpha=1-confidence_level
t_critical=t.ppf(1-alpha,deg_of_freedom)
t_critical

**Step 5** : 

5.1 Apply Decision Rule:
- For T-test:
  - Two Tail t-test:
  $$|t| \ > \ t_{n-1,\frac{\alpha}{2}} \Rightarrow Acceept \ H_1 \ or \ Reject \ H_0$$
  - Right Tail t-test:
  $$t \ > \ t_{n-1,\frac{\alpha}{2}} \Rightarrow Acceept \ H_1 \ or \ Reject \ H_0$$
  - left Tail t-test:
  $$t \ < \ t_{n-1,\frac{\alpha}{2}} \Rightarrow Acceept \ H_1 \ or \ Reject \ H_0$$

In [None]:
#plotting the sample distribution with rejection region

x_min = 220000
x_max = 490000

mean = pop_mean
std = sample_stddev / (n**0.5)

x = np.linspace(x_min, x_max, 100)
y = norm.pdf(x, mean, std)

plt.xlim(x_min, x_max)
# plt.ylim(0, 0.03)

plt.plot(x, y)

t_critical_left = pop_mean + (-t_critical * std)

x1 = np.linspace(x_min, t_critical_left, 100)
y1 = norm.pdf(x1, mean, std)
plt.fill_between(x1, y1, color='orange')

plt.scatter(sample_mean, 0)
# plt.annotate("X_bar", (sample_stddev, 0.02))




5.2 Compute p-Value $P(Test \ Statistics H_0)$
- For Two-Tailed Test:
  $$p \ value = 2 * (1- cdf(Test \ Statistics))$$
- For One-Tailed Test:
  $$p \ value = (1- cdf(Test \ Statistics))$$
  
- Now,
$$ if(p \ value \ < \ \alpha) \Rightarrow Accept \ H_1 \ or \ Reject \ H_0 $$

In [None]:
if(t_score < -t_critical):
    print("Conclusion:Reject Null Hypothesis")
else:
    print("Conclusion:Fail to reject Null Hypothesis")

In [None]:
p_value = 1.0 - norm.cdf(np.abs(t_score))

print("p_value = ", p_value)

if(p_value < alpha):
    print("Conclusion:Reject Null Hypothesis")
else:
    print("Conclusion: Fail to reject Null Hypothesis")

# 5.2 Is there a relationship between gender and specialisation? (i.e. Does the preference of Specialisation depend on the Gender?)

**Step - 1:**

 Alternate Hypothesis:$$ H_1: They \ are \ Dependent $$and Null Hypothesis:$$ H_0: They \ are \ Independent $$


**Step - 2:**
#### Collect the sample of size n
#### Compute the sample frequencies

In [None]:
pd.crosstab(df['Specialization'],df['Gender'],margins=True)

In [None]:
observed = pd.crosstab(df.Specialization, df.Gender)

observed

In [None]:
# chi2_contigency returns chi2 test statistic, p-value, degree of freedoms, expected frequencies
chi2_contingency(observed)

In [None]:
# Computing chi2 test statistic, p-value, degree of freedoms

chi2_test_stat = chi2_contingency(observed)[0]
pval = chi2_contingency(observed)[1]
df2 = chi2_contingency(observed)[2]
confidence_level = 0.90

alpha = 1 - confidence_level

chi2_critical = chi2.ppf(1 - alpha, df2)

chi2_critical

In [None]:
# Ploting the chi2 distribution to visualise

# Defining the x minimum and x maximum
x_min = 0
x_max = 100

# Ploting the graph and setting the x limits
x = np.linspace(x_min, x_max, 100)
y = chi2.pdf(x, df2)
plt.xlim(x_min, x_max)
plt.plot(x, y)


# Setting Chi2 Critical value 
chi2_critical_right = chi2_critical

# Shading the right rejection region
x1 = np.linspace(chi2_critical_right, x_max, 100)
y1 = chi2.pdf(x1, df2)
plt.fill_between(x1, y1, color='orange')

In [None]:
if(chi2_test_stat > chi2_critical):
    print("Reject Null Hypothesis")
else:
    print("Fail to Reject Null Hypothesis")

In [None]:
if(pval < alpha):
    print("Reject Null Hypothesis")
else:
    print("Fail to Reject Null Hypothesis")

**Step - 6** 
# Conclusion

**Step - 7** 
- We have done EDA on Aspiring Mind Employment Outcome 2015 (AMEO) dataset.
- We have done univariate analysis,bivariate analysis with some interesting observations.
- Mostly employees got there starting package of Rs 300000.
- Only ~24% Female candidates persued graduation.We need to motivate more females for graduation.
- The mean salary of other depts is higher than salary of candidates having specialization in CS and EC.
- We did hypothesis testing on a Times Of India Article claim. We have come to know that “After doing your Computer Science Engineering if you take up jobs as a Programming Analyst, Software Engineer, Hardware Engineer and Associate Engineer you can earn up to 2.5-3 lakhs as a fresh graduate.”
- We did another hypethesis testing to check dependecy between Gender and Specialisation. We have come to know that both are dependant features.

# (Bonus) Come up with some interesting conclusions or research questions.
- Perform feature transformation:
    -	For Numerical Features -> Do Column Standardization
    -	For Categorical -> if more than 2 categories, use dummy variables. Otherwise convert the feature to Binary

In [None]:
categorical_features= ['Specialization', 'CollegeState', 'Gender', 'Degree','12board','10board','CollegeTier','CollegeCityTier']
for i in categorical_features:
    unique = len(df[i].unique())
    print("{}: {}".format(i, unique))

Since in 4 column have a quite more unique values , we must trim the categorical values into in less than 10 for better results

In [None]:
df = pd.get_dummies(df, columns = categorical_features )

**Numerical Column Transformation**

In [None]:
from sklearn.preprocessing import StandardScaler
s=StandardScaler()
df[['Salary']]=s.fit_transform(df[['Salary']])
df[['10percentage']]=s.fit_transform(df[['10percentage']])
df[['12percentage']]=s.fit_transform(df[['12percentage']])

In [None]:
df.head()