## 1- **Importing Data from KaggleHub**

After downloading the dataset with KaggleHub, the files are stored locally inside the `path` folder.  
Often, Kaggle datasets contain one or more `.csv` files, so we need to load them into Pandas dataframes, and then combine them into one large dataframe.

```python
import os
import glob
csv_files = glob.glob(path + "/*.csv")         # Get a list of all CSV files in the dataset folder
dfs = [pd.read_csv(f) for f in csv_files]      # Read each CSV file into a Pandas DataFrame
df_combined = pd.concat(dfs, ignore_index=True) # Combine them into one large DataFrame, resetting the index
print(df_combined.shape)                       # Show rows x columns (size of dataset)
print(df_combined.head())                      # Show the first 5 rows of the dataset
```

If I have more than one file from different resources, I can add the name of each resource as a new column to the dataframe, so I can keep track of where each row came from.

```python
dfs = [pd.read_csv(file).assign(brand=os.path.splitext(os.path.basename(file))[0]) for file in csv_files]
df_combined = pd.concat(dfs, ignore_index=True)

## 2- **Data Cleaning and Preprocessing**


Before we start modifying the data (dropping columns, transforming, etc.), we need to understand the dataset first, and that's why we use the EDA (Exploratory Data Analysis) process.

- **What is EDA?** It's a process of analyzing the dataset to summarize its main characteristics, often using visual methods. The goal of EDA is to understand the data better, identify patterns, spot anomalies, test hypotheses, and check assumptions.
- **Why is EDA important?** Because it helps us to make informed decisions about how to clean and preprocess the data, which features to use for modeling, and which algorithms to apply. It also helps us to identify potential problems with the data, such as missing values, outliers, and multicollinearity.
- **How to perform EDA?** There are many techniques and tools available for EDA, but some common ones include:
  - Descriptive statistics: mean, median, mode, standard deviation, etc.
  - Data visualization: histograms, box plots, scatter plots, heatmaps, etc.
  - Correlation analysis: Pearson correlation, Spearman correlation, etc.

For more information, please refer to the [EDA](https://www.geeksforgeeks.org/data-analysis/what-is-exploratory-data-analysis/).

![mlconcepts_image7](/S1-Intro_to_ML/images/EDA.png)

### 1- **Dropping Unnecessary Columns**
After performing EDA, we can drop any columns that are not needed for our analysis or modeling. This can help to reduce the size of the dataset and improve the performance of our models.

```python
df.drop(columns=['unnecessary_column_1', 'unnecessary_column_2'], axis=1) # Drop unnecessary columns, axis=1 means columns
```

### 2- **Handling Missing Values**
if we have missing values in our dataset (NA, NaN, None, etc.), the models may not work properly, so we need to handle them before training our models.
Missing values can be handled in several ways, depending on the nature of the data and the extent of the missingness.<br>
Some common techniques include: Removing rows or columns with missing values, or imputing missing values with mean, median, mode, or using more advanced techniques like KNN imputation or regression imputation

- **What is imputation?** Imputation is the process of replacing missing values in a dataset with estimated values.
- **Why is imputation important?** Because many machine learning algorithms cannot handle missing values, and removing rows or columns with missing values can lead to loss of valuable information and reduced dataset size.
- **How to perform imputation?** There are several techniques for imputation, including:
  - Simple imputation: Replacing missing values with the mean, median, mode, or a constant value.
  - KNN imputation: Using the k-nearest neighbors algorithm to estimate missing values based on the values of similar rows.
  - Regression imputation: Using regression models to predict missing values based on other features in the dataset.


##### **Handling Missing Values with SimpleImputer**
- **Numeric Data**: 
  - **Mean** → average of all values <br>
  Formula: $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ <br>
  Useful when data is normally distributed (no big outliers). 

  - **Median** → middle value when data is sorted <br>
  Example: `[1, 3, 4, 9, 12] → median = 4`<br>
  Better than mean when data has **outliers** (skewed distribution).

- **Categorical Data**: 
  - **Mode** → most frequent value <br>
  Example: `[red, blue, blue, green] → mode = blue`<br>
  Works well if missing values are many, since the most frequent label is likely the best guess.

  - **Constant value** → replace with a fixed label (e.g., `"Unknown"`, `"N/A"`) <br>
  This is useful if missing values are few → instead of forcing them into an existing category, we keep them separate.

```python
from sklearn.impute import SimpleImputer
print(df_combined.isnull().sum())
# For numeric columns
num_imputer = SimpleImputer(strategy='median') 
df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])
# For categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])
```

### 3- **Encoding Categorical Variables**
Many machine learning algorithms require numerical input, so we need to convert categorical variables into numerical format. There are several techniques for encoding categorical variables, including:
- **One-Hot Encoding**: This technique creates a new binary column for each category in the original column. For example, if we have a column "Color" with categories "Red", "Blue", and "Green", one-hot encoding will create three new columns: "Color_Red", "Color_Blue", and "Color_Green". Each row will have a value of 1 in the column corresponding to its category and 0 in the other columns.
- **Label Encoding**: This technique assigns a unique integer to each category in the original column. For example, if we have a column "Color" with categories "Red", "Blue", and "Green", label encoding will assign the values 0, 1, and 2 to these categories, respectively. This technique is useful when there is an ordinal relationship between the categories (e.g., "Low", "Medium", "High").
- **Target Encoding**: This technique replaces each category in the original column with the mean of the target variable for that category. For example, if we have a column "Color" and a target variable "Price", target encoding will replace each category in the "Color" column with the mean price for that color. This technique can be useful when there is a strong relationship between the categorical variable and the target variable.

```python
from sklearn.preprocessing import OneHotEncoder
categorical_cols=["model","transmission","fuelType"]
encoder=OneHotEncoder(drop='first', sparse_output=False) # drop='first' to avoid dummy variable trap, sparse_output=False to get a dense array
encoded_data=encoder.fit_transform(df_combined[categorical_cols])
encoded_df=pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))
df_combined=pd.concat([df_combined, encoded_df], axis=1)
df_combined=df_combined.drop(categorical_cols, axis=1) # Drop original categorical columns to avoid errors
```

![mlconcepts_image8](/S1-Intro_to_ML/images/OHE.png)


### 4- **Removing Outliers**
Outliers are data points that are significantly different from the rest of the data. They can be caused by measurement errors, data entry errors, or natural variability in the data. Outliers can have a significant impact on the performance of machine learning models, so it is important to identify and remove them before training our models.
- **How to identify outliers?** There are several techniques for identifying outliers, including:
  - Visual inspection: Plotting the data using box plots, scatter plots, or histograms can help to identify outliers visually.
  - Statistical methods: Using statistical measures such as Z-scores or the IQR (Interquartile Range) can help to identify outliers mathematically.


Here are common **statistical methods** for handling outliers:
#### 1. Z-Score Method (Standard Deviation Method)

**Formula:** 
$$ Z = \frac{x - \mu}{\sigma} $$
Where:
- $ {x} $ : data point  
- $ {\mu} $ : mean of the dataset  
- $ {\sigma} $ : standard deviation  

**Rule of thumb:** if \(|Z| > 3\), the point is an outlier.  
**Best for:** **Normally distributed numerical data**.  

#### 2. IQR Method (Interquartile Range)
**Formulas:**
$$ IQR = Q_3 - Q_1 $$
$$ \text{Lower Bound} = Q_1 - 1.5 \times IQR $$
$$ \text{Upper Bound} = Q_3 + 1.5 \times IQR $$

- $ {Q1} $: 25th percentile  
- $ {Q3} $: 75th percentile  

Any point outside $[Lower, Upper]$ is considered an outlier.   
**Best for:** **Skewed numerical data**, robust to non-normal distributions.  

#### 3. Modified Z-Score (using Median and MAD)
**Formula:**
$$ M_i = \frac{0.6745 \, (x_i - \text{Median})}{MAD} $$
Where: $$ MAD = \text{Median} \big( |x_i - \text{Median}| \big) $$ 

- Threshold: if $|M_i| > 3.5$, the point is an outlier.  
**Best for:** **Numerical data with heavy skewness or outliers** (more robust than mean/std).  

#### 4. Domain / Business Rules
Sometimes, outliers are defined by **real-world knowledge**.  
Example: age cannot be negative, or engine size cannot be > 10L in a car dataset.  


```python
def remove_outliers(df,col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df_filtered = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df_filtered

numerical_cols = df_combined.select_dtypes(include=['float64', 'int64']).columns
for col in numerical_cols:
    df_combined = remove_outliers(df_combined, col)
```




![mlconcepts_image9](/S1-Intro_to_ML/images/IQR.webp)

### 5- **Feature Engineering**
Feature engineering is the process of creating, modifying, or transforming variables (features) in a dataset to help machine learning models capture patterns more effectively.

**How do we do Feature Engineering?**  
- **Domain Knowledge:** Use common sense or subject knowledge to create meaningful features.  
- **Combining Features:** Derive new ones from existing ones.  
- **Encoding Categorical Data:** Convert categories into numbers.  
- **Transformations:** Apply mathematical changes (log, square root) to reduce skewness.  
- **Extracting Information:** Break down dates, text, or images into useful variables.  
- **Evaluation:** Use correlation, feature importance, or model performance to check if the new features are useful.

**Why is Feature Engineering Important?**  
- **Improve accuracy**: Choosing the right features helps the model learn better, leading to more accurate predictions  
- **Reduce overfitting**: Using fewer, more important features helps the model avoid memorizing the data and perform better on new data.  
- **Boost interpretability**: Well-chosen features make it easier to understand how the model makes its predictions.  
- **Enhance efficiency**: Focusing on key features speeds up the model’s training and prediction process, saving time and resources.

**What is Correlation?**
Correlation measures the **strength and direction** of a linear relationship between two variables.
The **Pearson correlation coefficient** is defined as:
$$
\rho_{X,Y} = \frac{\text{cov}(X, Y)}{\sigma_X \, \sigma_Y}
$$

where  
- $\text{cov}(X, Y)$ is the covariance between variables $X$ and $Y$ 
- $\sigma_X$ and $\sigma_Y$ are the standard deviations of $X$ and $Y$, respectively.  
- The correlation coefficient ranges $\rho_{X,Y}$ from -1 to +1:
  - +1 indicates a perfect positive linear relationship (as one variable increases, the other also increases).
  - -1 indicates a perfect negative linear relationship (as one variable increases, the other decreases).
  - 0 indicates no linear relationship between the variables.

When two features are **highly correlated** (positively or negatively), they provide almost the **same information** to the model.  
Keeping both does not add value — it can lead to **multicollinearity**, which can confuse the model and cause overfitting.

```python
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
correlation_matrix = df_combined.corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
```

For more information, please refer to the [Feature Engineering](https://www.geeksforgeeks.org/machine-learning/what-is-feature-engineering/).


![mlconcepts_image10](/S1-Intro_to_ML/images/FE.png)

### 6-**Splitting the Dataset**-**Train-Test Split**
Before training our machine learning models, we need to split our dataset into training and testing sets. This allows us to evaluate the performance of our models on unseen data.

```python
x=df_combined.drop('price', axis=1)
y=df_combined['price']
print(y.mean())
print(y.median()) # just to recheck
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(x, y, test_size=0.4, random_state=42)  # 60% train
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)  # 20% val, 20% test
#As for the random_state, you can use any integer value. It is just a seed for the random number generator to ensure reproducibility.
````


### 7- **Feature Scaling**
Feature scaling is a technique used to standardize the range of independent variables or features of data. In machine learning, it is important because many algorithms are sensitive to the scale of the input features. If the features are on different scales, the algorithm may give more weight to the features with larger values, leading to biased results.

There are several methods for feature scaling:

1. **Min-Max Scaling**: This technique scales the data to a fixed range, usually 0 to 1. The formula is:
   $$
   X' = \frac{X - X_{min}}{X_{max} - X_{min}}
   $$

2. **Standardization (Z-score Normalization)**: This method scales the data to have a mean of 0 and a standard deviation of 1. The formula is:
   $$
   X' = \frac{X - \mu}{\sigma}
   $$
   where $\mu$ is the mean and $\sigma$ is the standard deviation.

3. **Robust Scaling**: This technique uses the median and the interquartile range for scaling, making it robust to outliers. The formula is:
   $$
   X' = \frac{X - X_{median}}{X_{IQR}}
   $$

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
````