# Wrangling Data with Scikit-learn

## 1) Introduction to Data Wrangling in Python

### What is Data Wrangling?
- Data wrangling = process of cleaning, structuring, and transforming raw data into a usable format for analysis.
- Data in real life is often incomplete, inconsistent, or messy.
- Goal: Make it suitable for machine learning models or analysis.
### Steps usually involved in data wrangling:

1. Collecting the data (CSV, SQL, API, Excel, etc.)
2. Cleaning (handling missing values, removing duplicates, fixing errors)
3. Transforming (scaling, normalization, encoding categorical values)
4. Feature engineering (extracting new features from existing ones)
5. Preparing final dataset for training/testing

##### In simple terms: Data Wrangling is like preparing ingredients before cooking a dish.

## 2) Overview of Scikit-learn and its Role in Data Science
### What is Scikit-learn?
- A popular Python library for Machine Learning.
- Built on top of
    - NumPy,
    - Pandas,
    - SciPy
    - Matplotlib.
- Provides ready-to-use tools for
    - data preprocessing,
    - wrangling,
    - model building,
    - evaluation, 
    - deployment.

### Why we can use Scikit-learn for Data Wrangling?
- Has utilities for
    - splitting datasets,
    - scaling,
    - encoding categorical features,
    - feature selection,
    - pipelines.
- Ensures consistency and reproducibility.
- Works seamlessly with Pandas DataFrames and NumPy arrays.
##### Think of Scikit-learn as a “Swiss Army Knife” for machine learning and preprocessing.

## 3) Playing with Datasets using Scikit-learn Utilities

- Scikit-learn provides toy datasets to practice.
- Scikit-learn gives us ready-made data for learning data wrangling and ML basics.

Example Datasets:
- iris → classification dataset of flower species
- digits → handwritten digits recognition dataset
- wine → wine classification dataset
- breast_cancer → medical classification dataset

In [1]:
from sklearn import datasets

##### Let's Load iris dataset

In [3]:
iris = datasets.load_iris()

##### Dataset features and target

In [37]:
print("Features shape:", iris.data.shape)
print("Target shape", iris.target.shape)

Features shape: (150, 4)
Target shape (150,)


##### Display feature names

In [9]:
print("Feature names:", iris.feature_names)

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


##### First 5 rows of data

In [13]:
print("Sample date:\n",iris.data[:5])

Sample date:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


### 1st summary
1) Data Wrangling = preparing data for analysis from messy data.

2) Scikit-learn helps automate preprocessing & ML tasks.

3) Scikit-learn provides toy datasets to practice wrangling before working on real-world data.

# Understanding Classes in Scikit-learn

## 4) Scikit-learn API Design Principles

Scikit-learn has a consistent API design, which makes it very beginner-friendly.

### Key principles:
1) Consistency
    - All objects share the same basic API:
    - .fit() → train/learn from data
    - .transform() → modify/transform data
    - .predict() → make predictions

2) Simplicity

    - Few main interfaces:
        - estimators,
        - transformers,
        - predictors,
        - pipelines.

3) Composability

    - Different parts can be combined together
    - Example: preprocessing + model inside a pipeline

**Analogy:**
- Think of Scikit-learn like LEGO blocks – each block (class) has a specific role, and you can connect them to build bigger systems.

## 5) Classes and Objects in Scikit-learn
- Classes = blueprints for models or preprocessing steps.
- Objects = instances of those classes.

Let's see Example:

In [67]:
from sklearn.linear_model import LinearRegression 

reg = LinearRegression()
print(type(reg))

<class 'sklearn.linear_model._base.LinearRegression'>


Here LinearRegression is a *class*, and reg is the *object*.
- **Class** -> LinearRegression
- **Object** -> reg (instance of LinearRegression)

## 6) Working with Estimators, Transformers, and Pipelines
- Estimator → learns from data (fit, predict)
- Transformer → transforms data (fit, transform)
- Pipeline → chain of transformers + estimator (makes workflows easier & reproducible)

### A. Estimators
An estimator is any object in Scikit-learn that can learn from data.
- It has a .fit() method to train on data.
- If it’s a model, it often has .predict() to make predictions.
- Examples:
    - LinearRegression()
    - LogisticRegression()
    - KMeans()
    - DecisionTreeClassifier()

In [86]:
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4]])   # features
Y = np.array([2,4,6,8])              # targeting

model = LinearRegression()           # Estimator
model.fit(X,Y)                       # Training

print("Prediction for 5:", model.predict([[5]]))

Prediction for 5: [10.]


**Analogy** to understand **Estimators**
- Think of an estimator as a teacher who learns patterns from student data (input X and output y).

### B. Transformers
- A transformer is an object that transforms data (changes its representation).
- It has a .fit() method (to learn parameters)
- and a .transform() method (to apply transformation).
- Examples:
    - StandardScaler() → scales features
    - MinMaxScaler() → normalizes values between 0 and 1
    - OneHotEncoder() → converts categories into numeric format

In [92]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()     # Transformer

X = [[1.0],[2.0],[3.0],[4.0],[3.0]]
scaler.fit(X)
print("Transformed data:\n", scaler.transform(X))

Transformed data:
 [[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


**Analogy** to understand **Transformers**
- A transformer is like a chef who prepares raw vegetables (raw data) into a clean, usable dish (processed data).

### C. Pipelines
- A pipeline is a sequence of transformers followed by an estimator.
    - First: transformers prepare/clean the data
    - Last: estimator trains/predicts
- Examples:
    - Scale → Encode → Train Model
    - Clean Data → Feature Selection → Train Classifier

In [107]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),            # transformer
    ('classifier', LogisticRegression())     # estimator
])

print(pipe)

Pipeline(steps=[('scaler', StandardScaler()),
                ('classifier', LogisticRegression())])


**Analogy** to understand **pipelines**

- A pipeline is like an assembly line in a factory. Raw materials (data) pass through several steps (transformers) and finally become a product (predictions from estimator).

### 2nd summary
4) The Scikit-learn API is consistent, with most classes using .fit(), .transform(), and .predict() methods.
5) In Scikit-learn, classes act as blueprints while objects are their working instances.
6) An estimator learns from data, a transformer modifies data, and a pipeline links them together into a simple workflow.

# Defining Applications for Data Science

## 7) Use Cases of Scikit-learn in Data Wrangling

- Scikit-learn is not only for machine learning models but also widely used for data preparation.
- Key Use Cases:
    - Handling Missing Data → SimpleImputer, KNNImputer
    - Scaling & Normalization → StandardScaler, MinMaxScaler
    - Encoding Categorical Features → OneHotEncoder, LabelEncoder
    - Feature Selection → SelectKBest, VarianceThreshold
    - Splitting Data → train_test_split
    - Pipelines for Automation → chaining preprocessing + model training

**Analogy** to understand **Scikit-learn**
- Think of Scikit-learn like a toolbox for carpenters – it has all the essential tools to prepare raw wood (raw data) into usable furniture (ML-ready dataset).

## 8) Feature Extraction, Preprocessing, and Transformation

### 1. Feature Extraction

- Turning raw data into structured features.

- Example:
    - From text → word counts, TF-IDF, hashing trick
    - From images → pixel values, edge features

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love data science", "Data wrangling with scikit-learn"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Feature Names:", vectorizer.get_feature_names_out())
print("Vectorized Data:\n", X.toarray())

Feature Names: ['data' 'learn' 'love' 'science' 'scikit' 'with' 'wrangling']
Vectorized Data:
 [[1 0 1 1 0 0 0]
 [1 1 0 0 1 1 1]]


### 2. Preprocessing
- Preparing raw data so models can understand it.
- Examples: scaling, normalization, encoding categories.

In [2]:
from sklearn.preprocessing import StandardScaler

data = [[10], [20], [30]]
scaler = StandardScaler()
print("Scaled Data:", scaler.fit_transform(data))

Scaled Data: [[-1.22474487]
 [ 0.        ]
 [ 1.22474487]]


### 3. Transformation
- Converting features into a different representation.
- Example: PolynomialFeatures → generate higher-order features.

In [12]:
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

X = np.array([[2], [3], [4]])
poly = PolynomialFeatures(degree=1)
print("Transformed Data:\n", poly.fit_transform(X))

Transformed Data:
 [[1. 2.]
 [1. 3.]
 [1. 4.]]


In [9]:
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

X = np.array([[2], [3], [4]])
poly = PolynomialFeatures(degree=2)
print("Transformed Data:\n", poly.fit_transform(X))

Transformed Data:
 [[ 1.  2.  4.]
 [ 1.  3.  9.]
 [ 1.  4. 16.]]


In [7]:
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

X = np.array([[2], [3], [4]])
poly = PolynomialFeatures(degree=3)
print("Transformed Data:\n", poly.fit_transform(X))

Transformed Data:
 [[ 1.  2.  4.  8.]
 [ 1.  3.  9. 27.]
 [ 1.  4. 16. 64.]]


## 9) Practical Mini-Projects (Do it by Yourself)

- Scikit-learn provides powerful tools for data wrangling before ML.

- Feature extraction = converting raw input into useful features.

- Preprocessing & transformation make data clean and standardized.

- Mini-projects (text, numeric, categorical) show real-life applications of wrangling.

#### A. Text Data (NLP Example)

- Problem: Sentiment classification (positive/negative reviews).
    - Step: Use CountVectorizer or TfidfVectorizer → transform text → train a classifier.
    
#### B. Numeric Data (Regression Example)
- Problem: Predict house prices.
    - Step: Scale numeric data (StandardScaler) → train regression model.

#### C. Categorical Data (Classification Example)
- Problem: Predict if a customer will buy a product.
    - Step: Encode categories (OneHotEncoder) → train LogisticRegression.    