# EDA

*Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to.*

<img src ="https://www.statistika.co/images/services/Exploratory%20Data%20Analysis%20-%20EDA%201000x468.jpg"/>

We will be performing EDA in two dataset **Titanic Dataset** and **Student Performance in Exam Dataset**.

# Import Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessing

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Titanic Data

In [None]:
train = pd.read_csv('/kaggle/input/titanic/train.csv')

In [None]:
train.head()

# Exploratory Data Analysis

**Finding if there is NULL value**

In [None]:
train.isnull().sum()

In [None]:
sns.heatmap(train.isnull(), yticklabels = False, cbar=False, cmap='viridis')



Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"

Let's continue on by visualizing some more of the data! Check out the video for full explanations over these plots, this code is just to serve as reference.


In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', data=train)

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Sex', data=train, palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Pclass', data=train, palette='rainbow')

In [None]:
sns.distplot(train['Age'].dropna(), kde = False, color = 'darkred', bins=40)

In [None]:
train['Age'].hist(bins=30, color='darkred', alpha=0.3)

In [None]:
sns.countplot(x='SibSp', data=train)

In [None]:
train['Fare'].hist(bins=40, color='green', figsize=(8,4))

# Data Cleaning

**We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. For example:**

In [None]:
plt.figure(figsize = (12,7))
sns.boxplot(x='Pclass', y='Age', data=train, palette='winter')

**We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.**

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        
        if Pclass == 1 :
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

**Now apply that function!**

In [None]:
train['Age'] = train[['Age', 'Pclass']].apply(impute_age, axis=1)

**Now let's check that heat map again!**

In [None]:
sns.heatmap(train.isnull(), yticklabels = False, cbar=False, cmap='viridis')

**Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.**

In [None]:
train.drop('Cabin', axis=1, inplace=True)

In [None]:
sns.heatmap(train.isnull(), yticklabels = False, cbar=False, cmap='viridis')

# Converting Categorical Features

**We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.**

In [None]:
train.info()

In [None]:
pd.get_dummies(train['Embarked'], drop_first=True).head()

In [None]:
embark = pd.get_dummies(train['Embarked'], drop_first = True)
sex = pd.get_dummies(train['Sex'], drop_first = True)

In [None]:
train.drop(['Sex', 'Embarked', 'Name', 'Ticket'], axis = 1, inplace = True)

In [None]:
train.head()

In [None]:
train = pd.concat([train, sex, embark], axis = 1)

In [None]:
train.head()

# Applying CatBoostClassifier

In [None]:
from catboost import CatBoostClassifier

In [None]:
# Making Features and Target Seperate from dataset
X = train.drop(['Survived'], axis=1)
Y = train['Survived']

# Taking 80% data for training
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,train_size=0.8,random_state=42)

In [None]:
model_LR = clf =CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)

#now just to make the model to fit the data
clf.fit(xtrain,ytrain,eval_set=(xtest,ytest), early_stopping_rounds=50)

In [None]:
model_LR.score(xtest, ytest)

# Students Performance in Exams


In [None]:
# Import Student Data 
data = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')

In [None]:
#Seeing the first 10 Data
data.head()

**Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution,**

In [None]:
data.describe()

In [None]:
data.columns

**Finding Unique Value in each Columns(nunique ---> Number of Unique)**

In [None]:
data.nunique()

**Seeing Unique Values in gender column**

In [None]:
data['gender'].unique()

# Cleaning the data

In [None]:
#Finding the number of null values 
data.isnull().sum()

In [None]:
# Remove the Irrelevent Columns
student = data.drop(['race/ethnicity','parental level of education'], axis=1)

In [None]:
student.head()

# Relationship Analysis

In [None]:
# Finding the correlation of student data
correlation = student.corr()

In [None]:
# Visualizing the correlation of the student data
sns.heatmap(correlation, xticklabels = correlation.columns, yticklabels = correlation.columns, annot = True)

**Pairplot plot pairwise relationships in a dataset. By default, this function will create a grid of Axes such that each numeric variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.**

In [None]:
sns.pairplot(student)

**replot shows relationship between variables.**

In [None]:
sns.relplot(x = 'math score', y = 'reading score', hue = 'gender', data = student)

**Distplot draw a histogram and fit a kernel density estimate (KDE).**

In [None]:
sns.distplot(student['math score'])

In [None]:
sns.distplot(student['writing score'])

In [None]:
sns.distplot(student['writing score'], bins=5)

**In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles.**

In [None]:
sns.catplot(x='math score', kind='box', data = student)