
<pre>
<center><b><h1>Machine Learning</b></center>

<center><b><h1>Lab - 3</b></center>¬†¬†¬†¬†
<pre>  

# üì± Lab: Scikit-Learn Fundamentals (Google Play Store)

**Objective:** Transition from manual data cleaning to automated Machine Learning preprocessing using Scikit-Learn.

**Prerequisites:**
* Ensure you have the `googleplaystore_cleaned.csv` file (from the previous lab) in this folder.

### 1. Load Preprocessed Data
**Instruction:** Load the dataset you cleaned in the previous lab. This dataset should already have `Installs`, `Price`, and `Reviews` converted to numbers.

In [81]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

print("Scikit-learn version:", sklearn.__version__)


Scikit-learn version: 1.7.2


In [82]:
df = pd.read_csv('googleplaystore.csv')

# CLEANING + FEATURE ENGINEERING
df = df.dropna(subset=['Rating'])
df = df[df['Reviews'] != '3.0M']

df['Reviews'] = pd.to_numeric(df['Reviews'], errors='coerce')
df['Installs'] = df['Installs'].astype(str).str.replace('[+,]', '', regex=True).astype(float)
df['Price'] = df['Price'].astype(str).str.replace('$', '').astype(float)
df['Size'] = df['Size'].astype(str).replace('Varies with device', np.nan)
df['Size'] = df['Size'].str.extract('(\d+\.?\d*)').astype(float)

print(f"Original shape: {df.shape}")
print("‚úÖ Data cleaned and numeric features ready!")
df.head()


Original shape: (9366, 13)
‚úÖ Data cleaned and numeric features ready!


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000.0,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000.0,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite ‚Äì FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000.0,Free,0.0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,50000000.0,Free,0.0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,100000.0,Free,0.0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Intro to Scikit-Learn
**What is Scikit-Learn?**
It is the standard library for Machine Learning in Python. We use it for:
1.  **Preprocessing:** Scaling numbers and encoding text.
2.  **Modeling:** Training algorithms.
3.  **Evaluation:** Checking accuracy.

**Task:** Import `sklearn` and check the version.

In [83]:
import sklearn

In [84]:
sklearn.__version__

'1.7.2'

In [85]:
#__version__

### 3.  Train_Test_Split
**Concept:** We split data to prevent "Overfitting". The model learns from the **Train** set and is tested on the **Test** set.

**Task:** 
1. Define `X` (Features: everything except Rating/App) and `y` (Target: Rating).
2. Split the data (80% Train, 20% Test).

In [86]:
X = df[['Reviews', 'Size', 'Installs', 'Price', 'Category', 'Content Rating', 'Type']]
y = df['Rating']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Train shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")


Train shape: (7492, 7)
Test shape: (1874, 7)


### 4. üìè Scaling Numerical Data (StandardScaler)
**Concept:** `Installs` (Millions) are much larger than `Rating` (1-5). We scale them so the model treats them equally.

**Task:** Use `StandardScaler` on the numerical columns.

In [87]:
numeric_features = ['Reviews', 'Size', 'Installs', 'Price']

scaler = StandardScaler()
X_train_num_scaled = scaler.fit_transform(X_train[numeric_features])
X_test_num_scaled = scaler.transform(X_test[numeric_features])

print("Numeric features scaled!")
print(f"Train scaled shape: {X_train_num_scaled.shape}")

Numeric features scaled!
Train scaled shape: (7492, 4)


### 5. üî† Encoding Categorical Data
**Concept:** Models need numbers, not text like "Business" or "Teen".

**Method A: Pandas `get_dummies` (Simple)**

In [88]:
#get_dummies

**Method B: Sklearn `OneHotEncoder` (Professional)**

In [89]:
#OneHotEncoder
#fit_transform

categorical_features = ['Category', 'Content Rating', 'Type']

ohe = OneHotEncoder(drop='first', sparse_output=False)
X_train_cat_encoded = ohe.fit_transform(X_train[categorical_features])
X_test_cat_encoded = ohe.transform(X_test[categorical_features])

print("OneHotEncoder applied!")
print(f"Train categories encoded: {X_train_cat_encoded.shape}")

OneHotEncoder applied!
Train categories encoded: (7492, 38)


### 6. üöÄ The Full Pipeline: ColumnTransformer
**Concept:** Instead of doing steps 4 and 5 manually, we wrap them in one object.

**Task:** Create a `ColumnTransformer` that Scales numerical data AND Encodes categorical data at the same time.

In [90]:
numeric_features = ['Reviews', 'Size', 'Installs', 'Price']
categorical_features = ['Category', 'Content Rating', 'Type']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)
])

X_train_final = preprocessor.fit_transform(X_train)
X_test_final = preprocessor.transform(X_test)

print("‚úÖ FULL PIPELINE COMPLETE!")
print(f"Final train shape: {X_train_final.shape}")
print(f"Final test shape: {X_test_final.shape}")


‚úÖ FULL PIPELINE COMPLETE!
Final train shape: (7492, 42)
Final test shape: (1874, 42)
