# **Hands on Machine Learning : Build While you Learn**

Week 2 : Shaping Data for ML: Clean, Transform and Prepare

Specially for you guys:

Github Repo:

In this Repo you will Find:
1. week 1 Python Codes
2. Learning Resources of week 1
3. Small assignments for your practice and you will submit it through the contribution
4. You will get a video in repo on how to contribute to the project

So By The end of our week 4 you will not only complete the basics of ML but you will also have your own First PR and Pull Issues Merged into our repo. You will learn about Git & Github along the way.

# **What is seaborn?**


*   Seaborn is a Python library built on top of Matplotlib, designed for beautiful, easy-to-use visualizations.
*   Compared to Matplotlib, it’s more high-level
*   It helps in spotting patterns, trends, and correlations in data.






# **Why Data Preparation ?**

If data is messy then the results will be messy as well. No matter how good the Algorithm is.

# **How do we find datasets ?**

Most of the People use Kaggle for finding datasets.

[Kaggle Link](https://www.kaggle.com/)

In [1]:
# Week 2: Shaping Data for ML – Cleaning, Transforming, Preparing

# Step 1: Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Step 2: Load Dataset
df = pd.read_csv("messy_songs_dataset.csv")

# Preview
print("First 5 rows:\n", df.head())
print("\nDataset Info:\n")
df.info()
print("\nSummary:\n", df.describe(include='all'))


FileNotFoundError: [Errno 2] No such file or directory: 'messy_songs_dataset.csv'

In [None]:
# Step 3: Identify Issues
print("\nMissing Values:\n", df.isnull().sum())
print("\nDuplicate Rows:", df.duplicated().sum())

# Example: Check unique categories in a column
print("\nUnique Languages:\n", df['Language'].unique())


# **What is Data Cleaning?**

1. Removing missing values
2. Fixing duplicates
3. Correcting inconsistencies
4. Making sure numbers are actually numbers

In [None]:
# Step 4: Cleaning Data

# Fix category casing
df['Language'] = df['Language'].str.strip().str.lower()

# Handle missing values (example: fill with mean)
df['Streams'] = df['Streams'].fillna(df['Streams'].mean())

# Clean messy ratings (remove * and convert to float)
df['Rating'] = df['Rating'].replace(r'[^0-9.]', '', regex=True).astype(float)

# Drop duplicates
df = df.drop_duplicates()

# After cleaning
print("\nCleaned Data Info:\n")
df.info()


# **What is Data Transformation?**

It is Basically Changing data into a form ML can understand.

1. Scaling: Bringing numbers to the same range (like converting height from cm and inches into one standard unit).
2. Encoding: Converting categories (like Hindi, English, Japanese) into numbers so ML can read them.

In [None]:
# Step 5: Visualization Before vs After Cleaning

# Example: Distribution of Streams
sns.histplot(df['Streams'], kde=True, color="blue")
plt.title("Distribution of Streams (Cleaned)")
plt.show()

# Example: Countplot of Languages
sns.countplot(x='Language', data=df)
plt.title("Songs by Language")
plt.xticks(rotation=45)
plt.show()


# **What is Sklearn?**


*   Scikit-learn is one of the most popular Python libraries for Machine Learning.
*   It provides simple tools for Data preprocessing(cleaning, encoding, scaling), Splitting data into training/testing sets, ML algorithms (linear regression, decision trees, KNN, SVM, etc.),Model evaluation (accuracy, confusion matrix, cross-validation).



# **What is Data Preparation?**

1. Splitting data into train and test
2. Feature selection: choosing the right “ingredients” that matter most.
3. Making sure everything is tidy, consistent, and ready for the ML model.

# **Types of Machine Learning**



*  **Supervised Learning** : You train the model with input + output labels.


1.   Model learns the mapping → predicts output for new input.
2.   Ex: Predicting house prices (input: size, location → output: price).
3.   Ex: A teacher gives you questions + correct answers to practice.
4.   If I have to say in one line then "Learn with supervision (labels)."


*   **Unsupervised Learning**


1.  You only provide inputs (no labels).
2.  Model finds patterns, groups, or structure in data.
3.  Ex: Customer segmentation (grouping people by buying habits).
4.  Ex: You’re given a pile of books and asked to organize them without knowing subjects.
5.   If I have to say in one line then "Learning without a teacher."


*   **Reinforcement Learning**


1.   The model (agent) learns by trial and error with rewards & penalties.
2.   Maximize reward by learning the best actions.
3.   Ex: A dog learns tricks if it does well, you give it a treat.
4.   If I have to say in one line then "Learning from feedback."








In [None]:
# Step 6: Simple Model Before vs After Cleaning

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score

# Example: Predicting Streams from Rating (Linear Regression)
X = df[['Rating']].fillna(0)
y = df['Streams']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)

print("Linear Regression MSE:", mean_squared_error(y_test, y_pred))

# Example: Predicting if song is 'popular' (Logistic Regression)
df['Popular'] = (df['Streams'] > df['Streams'].median()).astype(int)

X = df[['Rating']].fillna(0)
y = df['Popular']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))


# **Model on Messy Data**

1. Let's try running Machine Learning without any cleaning.
2. We'll directly feed the raw dataset into Linear Regression and Logistic Regression.

In [None]:
# Week 2: Shaping Data for ML
# Demo: Using Messy Dataset Without Cleaning

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score

# --- Load messy dataset ---
# (Use the messy songs dataset you created or Kaggle one)
df = pd.read_csv("messy_songs_dataset.csv")

print("First 5 Rows of Messy Data:\n", df.head())
print("\nDataset Info (Messy):\n")
df.info()

# --- Try Linear Regression ---
# Predict Streams from Rating (without cleaning)
X = df[['Rating']]   # contains messy values like '4.5*'
y = df['Streams']

# Convert errors to NaN and drop them (force model to run)
X = pd.to_numeric(X['Rating'], errors='coerce').fillna(0).values.reshape(-1,1)
y = pd.to_numeric(y, errors='coerce').fillna(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)

print("\nLinear Regression (Messy Data) MSE:", mean_squared_error(y_test, y_pred))

# --- Try Logistic Regression ---
# Define 'Popular' = 1 if Streams > median else 0
df['Popular'] = (pd.to_numeric(df['Streams'], errors='coerce') >
                 pd.to_numeric(df['Streams'], errors='coerce').median()).astype(int)

X = pd.to_numeric(df['Rating'], errors='coerce').fillna(0).values.reshape(-1,1)
y = df['Popular']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print("Logistic Regression (Messy Data) Accuracy:", accuracy_score(y_test, y_pred))
