# Data analysis from zero to hero

## Imports section

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import sklearn
import os

## First step: load your data set

For analyze a bunch of data you have, of course, to load it in the right way.

Depending on the type of data you are gonna analyze, there are various ways to import the related DataFrame.
For instance:
- CSV: pd.read_csv("file/path/to/data.csv", sep=";")
- Excel: pd.read_excel("file/path/to/data.xlsx")
- JSON: pd.read_json("file/path/to/data.json")
- HTML: pd.read_html("file/path/to/data.html")

In [None]:
path = os.path.join("sample_data", "housing.csv")
df = pd.read_csv(path)

## Exploratory Data Analysis

First of all, you have to look arond your data, in order to understand the basic features of the data frame

Adviced steps are the following one:
- Visualize the first 5 and the last 5 rows using df.head() and df.tail() respectively
- Have a look at the shape of the data set, using df.shape
- Visualize the available columns using df.columns
- Check data types using df.dtypes

For a brief look at the statistics and null values:
- df.describe() and df.describe().T
- df.info()
- df.isna().sum().sum() for overall count, only one sum for count by column

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df.describe().T

In [None]:
df.info()

## Manage null values

It can happen to have null values in columns

Depending on their importance in your analysis you can either drop them or fill them with a suitable value (e.g the column mean)

Note: this type of operation should be done on a copy of the original dataset, better safe than sorry

Note 2: the use of inPlace=True on some operations has been marked with a FutureWarning, so it is better to avoid it in favour of assignment. Other reasons to do it are clarity, readibility, and intention driven behaviour

In [None]:
# df['column_with_null_values'] = df['column_with_null_values'].dropna()
# df['column_with_null_values'] = df['column_with_null_values'].fillna(df['column_with_null_values'].mean())

## Filtering data

You may want to visualize or treat only a slice of the original Pandas series (column).
It can be done using boolean masks, their syntax is quite simple, but they are very useful sometimes

In [None]:
# boolean_mask = series_name > value
# boolean_mask_and = (series_name > value) & (series_name < value)
# boolean_mask_or = (series_name < value) || (series_name == value)

Another useful feature could be column renaming, in order to clarify what does that series contain. Or, more trivial, you may want to capitalize or format the name somehow

In [None]:
# df.columns = df.columns.map(lamda col: col.capitalize())
# df.columns = df.columns.map(str.capitalize)

## Plotting

Visualize the data you are working with in order to extract hidden features

In [None]:
df.hist(figsize=(10, 10), bins=50)

### Simple plot

In [None]:
x = list(range(0, 21))
y = [i**2 for i in x]

plt.figure(figsize=(10,6))
plt.plot(x, y)
plt.title('Square of X')
plt.xlabel('X')
plt.ylabel('Y')
plt.xticks(x)
plt.show()

### Scatter

c indicates the color to give at the scatter chart

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=x)

### Bar Chart

In [None]:
labels = ['MS', 'Apple', 'Meta', 'Google', 'Amazon']
stock_prices = [130, 112, 145, 180, 201]

plt.figure(figsize=(10, 4))
bars = plt.bar(labels, stock_prices)
bars[2].set_hatch('.')
bars[0].set_width(0.2)
bars[1].set_color('red')

### Heat map

In [None]:
plt.figure(figsize=(10, 10))
sb.heatmap(df.corr(), annot=True)

## Data Wrangling

This step is needed to remove outliers in order to have a smoother data set

The procedure here usually consists in replacing the outliers with the mean value

## Aggregations and grouping

Like an SQL statement, it may be necessary to group data by a column in order to analyze data by categories or groups, such as calculating summary statistics like the mean, sum, or count for each group

In [None]:
# grouped_df = df.groupby('COL_NAME').agg({'COL_1': fun1, 'COL_2': fun2})

### Concat, Merge and Join

These are three methods for combining DataFrames, each with different use cases

#### Concat

Concatenates DataFrames along an axis (row-wise or column-wise).

**Use case**: When you have similar structured data you want to append (like monthly reports into yearly data)

In [None]:
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Stack vertically (add rows)
result = pd.concat([df1, df2], axis=0)

# Stack horizontally (add columns)
result = pd.concat([df1, df2], axis=1)

#### Merge

Concatenates DataFrames based on common columns/keys (like SQL JOIN).

**Warning**: If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.

**Use case**: When you need to combine data based on matching values (like joining customer info with orders)

In [None]:
df1 = pd.DataFrame({'key': ['A', 'B'], 'value1': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'C'], 'value1': [3, 4]})

# Inner join (default)
# Left/right/outer/cross joins are also available
result = pd.merge(df1, df2, on='key', how='inner')

#### Join

Like merge, but joins on indices by default (convenient shortcut)

**Use case**: When DataFrames have meaningful indices you want to join on

In [None]:
df1 = pd.DataFrame({'A': [1, 2]}, index=['x', 'y'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['x', 'z'])

# Join on index (left join by default)
result = df1.join(df2, how='left')

## Learning models

When using sklearn in production/data science projects, follow this standard workflow:

### 1. Data Preparation and Exploration

Follow the steps shown in the previous sections of this notebook

In [None]:
# Load and explore data
import pandas as pd

df = pd.read_csv('data.csv')

# EDA: check shape, info, describe, missing values, distributions
X = df.drop('target', axis=1)
y = df['target']

### 2. Train-Test Split (Always First!)

In [None]:
from sklearn.model_selection import train_test_split

# Split BEFORE any preprocessing to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### 3. Preprocessing Pipeline 

In [None]:
from sklearn.preprocessing import StandardScaler

# Fit on training data only, transform both 
scaler = StandardScaler() 
X_train_scaled = scaler.fit_transform(X_train) 
X_test_scaled = scaler.transform(X_test) # Don't fit on test! 

### 4. Model Training and Cross-Validation 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = LogisticRegression() # Validate on training set using CV 
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5) 
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})") 

# Train final model 
model.fit(X_train_scaled, y_train)

### 5. Hyperparameter Tuning 

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']} 
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy') 
grid_search.fit(X_train_scaled, y_train) 
best_model = grid_search.best_estimator_ 

### 6. Evaluation on Test Set (Only Once!)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = best_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

### 7. Use Pipelines for Production 

In [None]:
from sklearn.pipeline import Pipeline

# Complete pipeline (preprocessing + model) 
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression()) ])
pipeline.fit(X_train, y_train) # Pipeline handles everything

### Key principles
- Split data first, then preprocess
- Fit preprocessing only on training data
- Use cross-validation on training set for model selection
- Test set touched only once for final evaluation
- Use Pipelines to ensure consistent transformations
- Set random_state for reproducibility
- Never fit scaler/encoder on entire dataset before splitting

### Main sklearn modules overview

#### 1. preprocessing 
- Purpose: Transform and normalize data before training 
- Key functions: StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder - Use cases:   
    - Scaling features to similar ranges (0-1 or mean=0, std=1)   
    - Encoding categorical variables   
    - Handling missing values   
    - Feature normalization for algorithms sensitive to feature scales (SVM, neural networks)
 
#### 2. linear_model 
- Purpose: Linear regression and classification models
- Key models: LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet 
- Use cases:
    - Predicting continuous values (house prices, sales)
    - Binary/multi-class classification
    - Feature selection with regularization
    - When interpretability is important

#### 3. naive_bayes 
- Purpose: Probabilistic classification based on Bayes' theorem 
- Key models: GaussianNB, MultinomialNB, BernoulliNB 
- Use cases:   
    - Text classification (spam detection, sentiment analysis)
    - Document categorization
    - Real-time prediction (fast training)
    - When features are independent
 
#### 4. model_selection
- Purpose: Model evaluation and hyperparameter tuning 
- Key functions: train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV 
- Use cases:
    - Splitting data into train/test sets
    - Cross-validation to prevent overfitting
    - Finding optimal hyperparameters
    - Model performance comparison
 
#### 5. metrics
- Purpose: Evaluate model performance
- Key functions: accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, mean_squared_error
- Use cases:
    - Classification metrics (accuracy, precision, recall)
    - Regression metrics (MSE, MAE, R²)
    - ROC curves for threshold selection
    - Model comparison
 
#### 6. ensemble 
- Purpose: Combine multiple models for better performance 
- Key models: RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier 
- Use cases:
    - Reducing overfitting (Random Forest)
    - Improving prediction accuracy
    - Handling complex non-linear relationships

### RandomForestClassifier overview

The typical procedure for using RandomForestClassifier from scikit-learn:

1. Import the classifier:
2. Prepare your data:
3. Create and configure the model:
4. Train the model:
5. Make predictions:
6. Evaluate the model:

**Key parameters to tune:**
- `n_estimators:` Number of trees in the forest (default: 100)
- `max_depth:` Maximum depth of each tree
- `min_samples_split:` Minimum samples required to split a node
- `max_features:` Number of features to consider for best split
- `class_weight:` Handle imbalanced datasets

#### Tips and tricks for selecting features and target

**Target Variable (y)**
- Must be the column you want to predict
- Should be categorical for classification (e.g., 'species', 'diagnosis', 'category')
- Should be a single column

**Features (X)**

These are the input variables used to predict y.

Selection criteria:
1. Exclude the target variable
2. Remove irrelevant columns:
    - ID columns (e.g., 'customer_id', 'transaction_id')
    - Timestamps (unless using as features after processing)
    - Columns with constant values
3. Consider correlation:
    - Include features that correlate with the target
    - Remove highly correlated features (multicollinearity)
4. Handle data types:
    - Numerical features can be used directly
    - Categorical features need encoding (one-hot, label encoding)

__Best practice:__ Start with domain knowledge to identify relevant features, then refine using feature importance or selection techniques.