**Workshop Title: Intermediate Python Workshop: EDA, Statistical Testing and Introduction to Machine Learning**

**Duration: 3 hours**

**Objective**: To provide intermediate Python programmers with practical knowledge on Exploratory Data Analysis (EDA), statistical testing, and a brief introduction to machine learning.

**Prerequisites**: Basic knowledge of Python programming and basic understanding of statistics.

---

### I. Introduction (15 minutes)

- Brief Introduction to the Workshop Topics
- Importance and Applications of EDA, Statistical Testing and Machine Learning

---

### II. Exploratory Data Analysis (EDA) with Python (45 minutes)

**Key Concepts**: Pandas, Numpy, Matplotlib, Seaborn

- Introduction to EDA
- Python Libraries for EDA: Brief Overview
- Data Cleaning:
  - Handling Missing Data
  - Removing Duplicates
  - Data Type Conversion
- Data Visualization:
  - Histograms
  - Box plots
  - Scatter plots
- Descriptive Statistics: Mean, Median, Mode, Skewness, Kurtosis

---

### III. Statistical Testing with Python (45 minutes)

**Key Concepts**: Scipy, ANOVA, T-Test

- Introduction to Statistical Testing
- Hypothesis Testing Overview
- T-Test:
  - One-sample T-test
  - Two-sample T-test
- ANOVA:
  - One-way ANOVA
  - Two-way ANOVA
- Interpretation of Results

---

### IV. Introduction to Machine Learning with Python (1 hour)
**Key Concepts**: Scikit-learn, Supervised Learning, Unsupervised Learning

- Introduction to Machine Learning
- Types of Machine Learning: Supervised, Unsupervised, and Reinforcement Learning
- Supervised Learning:
  - Linear Regression: Brief Overview and Python Implementation
  - Classification: Brief Overview and Python Implementation (e.g., Logistic Regression)
- Unsupervised Learning:
  - Clustering: Brief Overview and Python Implementation (e.g., K-means)
- Introduction to Scikit-Learn Library
- Data Splitting: Training Set and Test Set
- Model Training and Prediction
- Model Evaluation

--- 

### V. Q&A and Closing Remarks (15 minutes)

- Attendee Questions
- Summary of the Workshop
- Further Learning Resources

# Section II: EDA

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Loading the data

In [None]:
# Load the dataset
df = sns.load_dataset('titanic')

# Display the first few rows
print(df.head())


Basic information

In [None]:
# Get some basic information about the data
print(df.info())

In [None]:
# Get the numerical data description
print(df.describe())

In [None]:
# Get the categorical data description
print(df.describe())

## Data Cleaning

### Missing Values

In [None]:
# Check for missing values
print(df.isnull().sum())

#### Numerical data

In [None]:
# Fill missing age data with median age
df['age'].fillna(df['age'].median(), inplace=True)

#### Categorical data

In [None]:
# Fill missing embarked data with the mode
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)


#### Too many missing?

In [None]:
# Drop the 'deck' column as it has too many missing values
df = df.drop(['deck'], axis=1)

### Duplicates

In [None]:
# Removing duplicates
df = df.drop_duplicates()

### Outlier detection

In [None]:
# This is just an example
# Outlier Detection with IQR for 'age'
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1

# Defining the acceptable range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

In [None]:
# Identifying outliers
outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print(outliers)

In [None]:
# Removing outliers
df_out = df[(df['age'] >= lower_bound) & (df['age'] <= upper_bound)]
# Note this is not going to change df, but generating a new df called df_out

In [None]:
# Comparing shapes of the original and cleaned dataframes
print(df.shape)
print(df_out.shape)

## Convert data type

In [None]:
# Converting 'survived' from int to bool
df['survived'] = df['survived'].astype(bool)
print(df.info())

## Replace with mapping

In [None]:
map_sex = {'female': 1, 'male': 0}
df['sex'] = df['sex'].map(map_sex)
print(df.head())

## Data Visulization

In [None]:
# Checking the updated info
print(df.info())

# Checking the descriptive statistics
print(df.describe())

### Histograms

In [None]:
# Histograms for numerical columns
df.hist(bins=30, figsize=(10,10))
plt.tight_layout()
plt.show()

### BoxPlot

In [None]:
# Box plot of age across different classes
sns.boxplot(x='class', y='age', data=df)
plt.show()

### Bar Plot

In [None]:
# Bar plot of survival by class
sns.barplot(x='class', y='survived', data=df)
plt.show()


### Pairplot

In [None]:
# Pairplot to visualize the relationships between variables
sns.pairplot(df, hue='survived')
plt.show()

### Correlation matrix

In [None]:
# Correlation matrix
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

# Section III: Statistical Testing with Python (45 minutes)

In [None]:
from scipy import stats

## One-sample t-test

The one-sample T-test is used when we want to compare a sample mean with a population mean. It helps us to understand if the sample taken from the population has the same mean or not. The scenario in the code is checking whether the mean age of passengers on the Titanic is 30 or not.

Null hypothesis (H0): The mean age of all passengers on the Titanic is 30.

Alternative hypothesis (H1): The mean age of all passengers on the Titanic is not 30.

In [None]:
# One-sample T-test
age = df_out['age']
age_mean = np.mean(age)
tset, pval = stats.ttest_1samp(age, 30)  # testing against mean age 30
print('One-sample T-test p-value', pval)


## Two-sample T-test

The two-sample T-test is used when we want to compare the means of two different samples. This test tells us if the two samples come from the same population or not. In the code, we're comparing the mean age of passengers in first class and third class to see if they're the same.



Null hypothesis (H0): The mean age of passengers in first class is equal to the mean age of passengers in third class.

Alternative hypothesis (H1): The mean age of passengers in first class is not equal to the mean age of passengers in third class.

In [None]:
age_class_1 = df_out[df_out['class'] == 'First']['age']
age_class_3 = df_out[df_out['class'] == 'Third']['age']
ttest, pval = stats.ttest_ind(age_class_1, age_class_3)
print('Two-sample T-test p-value', pval)


## Paired T-test

The paired T-test is used when we want to compare the means of the same group at two different times. For example, this could be used to compare a person's weight before and after a certain treatment. In the code, we're comparing the mean age of the first 50 passengers at two different time points (though we've simulated this with an increment of 1 for simplicity).

Null hypothesis (H0): The mean age of the first 50 passengers at time1 (before) is equal to their mean age at time2 (after).

Alternative hypothesis (H1): The mean age of the first 50 passengers at time1 (before) is not equal to their mean age at time2 (after).

In [None]:
age_before = df_out['age'].iloc[:50]
age_after = df_out['age'].iloc[:50] + 1
paired_ttest, pval = stats.ttest_rel(age_before, age_after)
print('Paired T-test p-value', pval)

## One-way ANOVA
The one-way analysis of variance (ANOVA) is used when we want to compare the means of more than two groups. It tells us if at least one group is significantly different from the others. In the code, we're comparing the mean ages of passengers in first, second, and third classes.

Null hypothesis (H0): The mean ages of passengers in first, second, and third classes are all equal.

Alternative hypothesis (H1): At least one class has a different mean age compared to the others.



fstat, pval = stats.f_oneway(df_out[df_out['class'] == 'First']['age'], 
                             df_out[df_out['class'] == 'Second']['age'], 
                             df_out[df_out['class'] == 'Third']['age'])
print('One-way ANOVA p-value', pval)

## Chi-square Test for Independence

The chi-square test for independence is used when we want to see if there is a relationship between two categorical variables. In other words, it tests whether the occurrence of one categorical variable affects the occurrence of another categorical variable. In the code, we're checking if the 'survived' variable is related to the 'sex' variable.

Null hypothesis (H0): The 'survived' and 'sex' variables are independent, i.e., survival rate is the same for males and females.

Alternative hypothesis (H1): The 'survived' and 'sex' variables are not independent, i.e., survival rate is not the same for males and females.


In [None]:
# Here we check if 'survived' is related to 'sex'
contingency_table = pd.crosstab(df_out['survived'], df_out['sex'])
chi2, pval, dof, expected = stats.chi2_contingency(contingency_table)
print('Chi-square p-value', pval)


# Section IV: Introduction to Machine Learning with Python

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

Define Xs and Y:

- We'll predict 'survived' based on 'pclass', 'sex', 'age', 'sibsp', 'parch'

- This is supervised question

## Data preparation

Preparing the dataset by selecting relevant features and the target variable. In this case, we're trying to predict 'survived' based on 'pclass', 'sex', 'age', 'sibsp', 'parch'.

In [None]:
X = df_out[['pclass', 'sex', 'age', 'sibsp', 'parch']]
y = df_out['survived']

Splitting the data into training and test sets. This allows us to train our models on one set of data and then test them on unseen data.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Standardizing the features. Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a weighted sum of the input, like logistic regression, and algorithms that use distance measures, like k-nearest neighbors.

In [None]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Training and evaluating three different types of models: Logistic Regression, Random Forest, and Gradient Boosting. Each model is trained on the training data and then used to make predictions on the test data. The models' performance is evaluated using the classification report, which provides key metrics such as precision, recall, and f1-score.

## Logistic Regression

In [None]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print("Logistic Regression:")
print(classification_report(y_test, y_pred))

## Random Forest

In [None]:
# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Random Forest:")
print(classification_report(y_test, y_pred))


## Gradient Boosting

In [None]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
print("Gradient Boosting:")
print(classification_report(y_test, y_pred))

## Classification Report

Let's assume we have the following confusion matrix for a binary classification problem:

|   | Predicted: No  | Predicted: Yes |
|---|----------------|----------------|
| Actual: No  | TN = 50  | FP = 10 |
| Actual: Yes | FN = 5   | TP = 100|

### Accuracy:

**Accuracy** Accuracy is the ratio of correctly predicted observations (TP + TN) to the total observations (TP + TN + FP + FN).

Accuracy = (TP + TN) / (TP + TN + FP + FN)

= (100 + 50) / (100 + 50 + 10 + 5) = 0.91

### Precision: 

**Precision** is the ratio of correctly predicted positive (TP) observations to the total predicted positives (TP + FP). High precision relates to the low false positive rate. It is defined as:

Precision = True Positives / (True Positives + False Positives)

= 100 / (100 + 10) = 0.91

### Recall (Sensitivity or True Positive Rate)

**Recall** is the ratio of correctly predicted positive (TP) observations to all actual positives (TP + FN). It is defined as:

Recall = True Positives / (True Positives + False Negatives)

= 100 / (100 + 5) = 0.95

### F1 Score

**F1 Score** is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. It's defined as:

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

= 2 * (0.95 * 0.91) / (0.95 + 0.91) = 0.93