## EXPLORATORY DATA ANALYSIS (EDA)
#### EDA is an important step in data analysis process. It not only involves examining the data to understand its structure, detect patterns, spot anomalies, test hypotheses, and check assumptions but it also helps in summarizing the main characteristics of the data, often using visual methods.

#### Here's an example of how EDA works:



In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("/Users/ambigaur/Desktop/Ambi_coding/python/Prasunet_DS/Titanic_train.csv")
df.sample(10)

#### 1. Statistical Summary

In [None]:
df.sample(10)
df.info()
df.describe()

#### 2. Understanding dataset and data distribution : 
##### Checking the few rows, summary statistics, data types and distribution of data

In [None]:
import matplotlib.pyplot as plt
df.hist(bins=20, figsize=(20,15))

plt.show()

#### 3. Graphical Analysis

##### a. Categorical Data Analysis

In [None]:
import seaborn as sns
plt.figure(figsize=(20,15))

plt.subplot(3,1,1)
sns.countplot(data=df, x='Survived')
plt.title('Survival Count')

plt.subplot(3,1,2)
sns.countplot(data=df, x='Pclass')
plt.title('Passenger class count')

plt.subplot(3,1,3)
sns.countplot(data=df, x='Sex')
plt.title('Gender Count')

plt.tight_layout()
plt.show()

##### b. Bivariate Data Analysis

In [None]:
sns.barplot(data=df, x='Pclass', y='Survived')
plt.title('Survival rate by Passenger class')
plt.show()

sns.barplot(data=df, x='Sex', y='Survived')
plt.title('Survival rate by different gender')
plt.show()

sns.barplot(data=df, x='Embarked', y='Survived')
plt.title('Survival rate by port of embarkation')
plt.show()


In [None]:
plt.figure(figsize=(20,15))

plt.subplot(1,2,1)
sns.boxplot(data=df, x='Survived', y='Age')
plt.title('Age Distribution by Survival')

plt.subplot(1,2,2)
sns.boxplot(data=df, x='Survived', y="Fare")
plt.title('Fare Distribution by Survival')

plt.tight_layout()
plt.show()

plt.figure(figsize=(10,15))
sns.scatterplot(data=df, x='Age', y='Fare', hue='Survived')
plt.title('Age Vs Fare Coloured by Survived')
plt.show()

#### 4. Correlation Analysis

In [None]:
numeric_df = df.select_dtypes(include=[np.number])
plt.figure(figsize=(10,15))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Analysis')
plt.show()

#### 5. Missing Value Analysis

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(numeric_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing value analysis')
plt.show()

## Summary 
#### 1.Basic Information: Start by understanding the structure and summary statistics of the data to get an overall sense of what it contains.
#### 2.Data Distribution: Examine how numerical features are distributed to identify patterns and outliers.
#### 3.Feature Analysis: Analyze categorical features using count plots, study relationships between survival and other features, and use box plots and scatter plots to explore numerical features. This helps you see the distribution, frequency, and interactions of different variables.
#### 4.Correlation Analysis: Identify correlations between features to understand how variables are related and possibly dependent on each other.
#### 5.Missing Values Analysis: Visualize and understand the pattern of missing data to determine how to handle these gaps in your analysis or modeling.
#### Now we move on to our next step Data Cleaning

# Data Cleaning
#### Data Cleaning is the process of detecting and then correcting, either by removing or using other techniques, errors and inconsistencies from the data to improve the quality of the data. Data Cleaning is often reffered as data scrapping or data cleansing. There are several ways through which we can identify and rectify different irregularities present in the data.

## 1. Removal of Duplicate data:
#### Indentification and removal of duplicate entries in a dataset.
## 2. Handling missing values:
#### Indentification of missing values and then processing them. There are few ways by which we can handle missing values either by removing them or replacing them with values like mean, median, mode of the given data columns.
## 3. Dealing with Outliers: 
#### Identifying and handling outliers, which are data points significantly different from others. Depending on the context, outliers may be corrected, removed, or kept.
## 4. Standardizing Data: 
#### Making sure that data is consistent and uniformaly formatted like standardizing date formats, text capitalization, and units of measurement.
## 5. Removal of Irrelevant data:
#### Looking after data that doesn't align with the analysis and desicion making processes and then removing them.
## 6. Validating Data:
#### Ensuring that the data conforms to defined rules and constraints, such as data type constraints, range constraints, and unique constraints.
## 7. Fixing Inaccurate Data: 
#### Fixing any errors or inaccuracies in the data, such as misspelled words, incorrect entries, or outdated information.


#### 1. Dropping duplicates

In [None]:
df.drop_duplicates(inplace=True)

#### 2. Handling missing values

In [None]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
df['Sex'].fillna(df['Sex'].mode()[0], inplace=True)
df['Survived'].fillna(df['Survived'].mode()[0], inplace=True)
df['Cabin'].fillna("Unknown", inplace=True)
df['Pclass'].fillna(df['Pclass'].mode()[0], inplace=True)
df['Parch'].fillna(df['Parch'].mode()[0], inplace=True)
df['SibSp'].fillna(df['SibSp'].mode()[0], inplace=True)///

In [None]:
df.fillna({'Age': df['Age'].median()}, inplace=True)
df.fillna({'Embarked': df['Embarked'].mode()[0]}, inplace=True)
df.fillna({'Fare': df['Fare'].median()}, inplace=True)
df.fillna({'Sex': df['Sex'].mode()[0]}, inplace=True)
df.fillna({'Survived': df['Survived'].mode()[0]}, inplace=True)
df.fillna({'Cabin': "Unknown"}, inplace=True)
df.fillna({'Pclass': df['Pclass'].mode()[0]}, inplace=True)
df.fillna({'Parch': df['Parch'].mode()[0]}, inplace=True)
df.fillna({'SibSp': df['SibSp'].mode()[0]}, inplace=True)

#### 3. Dealing with Outliers
##### Here we are calculating the upper_limit by using the formula: UP= mean + 3*Standard_deviation
##### After calculating the upper limit we are simply replacing the numbers greater than that with upper limit.

In [None]:
age_upper_limit= df['Age'].mean() + 3* df['Age'].std()
df.loc[df['Age']>age_upper_limit, 'Age'] = age_upper_limit

fare_upper_limit= df['Fare'].mean() + 3* df['Fare'].std()
df.loc[df['Fare']> fare_upper_limit, 'Fare']= fare_upper_limit

#### 4. Standardizing data: Here we are converting the values in sex column to lower space for proper functioning.

In [None]:
df['Sex']= df['Sex'].apply(lambda x: x.strip().lower())

#### 5. Removal of irrelevant data

#### 6. Validating Data

In [None]:
assert df['Pclass'].isin([1, 2, 3]).all(), "Pclass contains invalid values"
assert df['Survived'].isin([0, 1]).all(), "Survived contains invalid values"
assert df['Sex'].isin(['male', 'female']).all(), "Sex contains invalid values"
assert df['Embarked'].isin(['C', 'Q', 'S']).all(), "Embarked contains invalid values"

#### 7. Fixing Inaccurate data: There's no such data

In [None]:
df.to_csv('Titanic_Cleaned_train.csv', index=False)


In [None]:
df.sample(10)