In [None]:
import pandas as pd
import numpy as np
import json
import requests
import plotly.express as px
import matplotlib.pyplot as plt
import datetime
import time
from bs4 import BeautifulSoup
from bs4 import BeautifulStoneSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import tempfile
import time
from bs4 import BeautifulSoup
import seaborn as sns

In [None]:
train=pd.read_csv('train.csv')
train.head()

In [None]:
test=pd.read_csv('test.csv')
test.head()

### Column Types

- **Numerical** - Age,Fare,PassengerId
- **Categorical** - Survived, Pclass, Sex, SibSp, Parch,Embarked
- **Mixed** - Name, Ticket, Cabin

### Univariate Analysis 

Univariate analysis focuses on analyzing each feature in the dataset independently.

- **Distribution analysis**: The distribution of each feature is examined to identify its shape, central tendency, and dispersion.

- **Identifying potential issues**: Univariate analysis helps in identifying potential problems with the data such as outliers, skewness, and missing values

#### The shape of a data distribution refers to its overall pattern or form as it is represented on a graph. Some common shapes of data distributions include:

- **Normal Distribution**: A symmetrical and bell-shaped distribution where the mean, median, and mode are equal and the majority of the data falls in the middle of the distribution with gradually decreasing frequencies towards the tails.

- **Skewed Distribution**: A distribution that is not symmetrical, with one tail being longer than the other. It can be either positively skewed (right-skewed) or negatively skewed (left-skewed).

- **Bimodal Distribution**: A distribution with two peaks or modes.

- **Uniform Distribution**: A distribution where all values have an equal chance of occurring.

The shape of the data distribution is important in identifying the presence of outliers, skewness, and the type of statistical tests and models that can be used for further analysis.

#### **Dispersion** is a statistical term used to describe the spread or variability of a set of data. It measures how far the values in a data set are spread out from the central tendency (mean, median, or mode) of the data.

There are several measures of dispersion, including:

- **Range**: The difference between the largest and smallest values in a data set.

- **Variance**: The average of the squared deviations of each value from the mean of the data set.

- **Standard Deviation**: The square root of the variance. It provides a measure of the spread of the data that is in the same units as the original data.

- **Interquartile range (IQR)**: The range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data.

Dispersion helps to describe the spread of the data, which can help to identify the presence of outliers and skewness in the data.

### Steps of doing Univariate Analysis on Numerical columns

- **Descriptive Statistics**: Compute basic summary statistics for the column, such as mean, median, mode, standard deviation, range, and quartiles. These statistics give a general understanding of the distribution of the data and can help identify skewness or outliers.

- **Visualizations**: Create visualizations to explore the distribution of the data. Some common visualizations for numerical data include histograms, box plots, and density plots. These visualizations provide a visual representation of the distribution of the data and can help identify skewness an outliers.

- **Identifying Outliers**: Identify and examine any outliers in the data. Outliers can be identified using visualizations. It is important to determine whether the outliers are due to measurement errors, data entry errors, or legitimate differences in the data, and to decide whether to include or exclude them from the analysis.

- **Skewness**: Check for skewness in the data and consider transforming the data or using robust statistical methods that are less sensitive to skewness, if necessary.

- **Conclusion**: Summarize the findings of the EDA and make decisions about how to proceed with further analysis.


### Age

**conclusions**

- Age is normally(almost) distributed
- 20% of the values are missing
- There are some outliers

### Steps of doing Univariate Analysis on Categorical columns

**Descriptive Statistics**: Compute the frequency distribution of the categories in the column. This will give a general understanding of the distribution of the categories and their relative frequencies.

**Visualizations**: Create visualizations to explore the distribution of the categories. Some common visualizations for categorical data include count plots and pie charts. These visualizations provide a visual representation of the distribution of the categories and can help identify any patterns or anomalies in the data.

**Missing Values**: Check for missing values in the data and decide how to handle them. Missing values can be imputed or excluded from the analysis, depending on the research question and the data set.

**Conclusion**: Summarize the findings of the EDA and make decisions about how to proceed with further analysis.

### Survived

**conclusions**

- Parch and SibSp cols can be merged to form  a new col call family_size
- Create a new col called is_alone

### Steps of doing Bivariate Analysis

- Select 2 cols
- Understand type of relationship
    1. **Numerical - Numerical**<br>
        a. You can plot graphs like scatterplot(regression plots), 2D histplot, 2D KDEplots<br>
        b. Check correlation coefficent to check linear relationship
    2. **Numerical - Categorical** - create visualizations that compare the distribution of the numerical data across different categories of the categorical data.<br>
        a. You can plot graphs like barplot, boxplot, kdeplot violinplot even scatterplots<br>
    3. **Categorical - Categorical**<br>
        a. You can create cross-tabulations or contingency tables that show the distribution of values in one categorical column, grouped by the values in the other categorical column.<br>
        b. You can plots like heatmap, stacked barplots, treemaps
        
- Write your conclusions

In [None]:
train['Age'].plot(kind='kde')

In [None]:
# plt.scatter(train.index, train['Age'])
# plt.xlabel('Index')
# plt.ylabel('Age')
# plt.title('Scatter Plot of Age')
# plt.show()


In [None]:
train['Age'].plot(kind='hist')

In [None]:
df=train

In [None]:
train['Age'].plot(kind='box')

In [None]:
pd.crosstab(df['Survived'],df['Pclass'],normalize='columns')*100

In [None]:
sns.heatmap(pd.crosstab(df['Survived'],df['Pclass'],normalize='columns')*100)

In [None]:
ct = pd.crosstab(df['Survived'], df['Pclass'], normalize='columns') * 100

# Plot with Plotly
fig = px.imshow(ct,
                text_auto=True,
                color_continuous_scale='Blues',
                labels=dict(x="Pclass", y="Survived", color="Percentage"))

fig.update_layout(title="Survival Rate by Passenger Class (%)",
                  xaxis_title="Passenger Class",
                  yaxis_title="Survived",
                  yaxis=dict(tickmode='array', tickvals=[0, 1]))

fig.show()

In [None]:
pd.crosstab(df['Survived'],df['Sex'],normalize='columns')*100

In [None]:
sns.heatmap(pd.crosstab(df['Survived'],df['Sex'],normalize='columns')*100)

In [None]:
# Crosstab with normalization
ct = pd.crosstab(df['Survived'], df['Sex'], normalize='columns') * 100

# Plot using Plotly
fig = px.imshow(ct,
                text_auto=True,
                color_continuous_scale='Blues',
                labels=dict(x="Sex", y="Survived", color="Percentage"))

fig.update_layout(title="Survival Rate by Sex (%)",
                  xaxis_title="Sex",
                  yaxis_title="Survived",
                  yaxis=dict(tickmode='array', tickvals=[0, 1]))

fig.show()

In [None]:
plt.figure(figsize=(20,14))
df[df['Survived']==1]['Age'].plot(kind='kde',label='Survived')
df[df['Survived']==0]['Age'].plot(kind='kde',label='Unsurvived')

plt.legend()
plt.show()

In [None]:
import plotly.figure_factory as ff

ct1 = df[df['Survived'] == 1]['Age'].dropna()
ct2 = df[df['Survived'] == 0]['Age'].dropna()

# Create KDE plots
fig = ff.create_distplot(
    [ct1, ct2],               # list of data series
    group_labels=['Survived', 'Unsurvived'],  # labels for legend
    show_hist=False,          # only KDE, no histogram
    show_rug=False            # optional: remove rug plot
)

fig.update_layout(title_text='KDE Plot of Age - Survived vs Unsurvived')
fig.show()
