## Theoratical Assignment

1. Explain Supervised vs. Unsupervised Learning

Ans. Supervised Learning:

- In Supervised Learning, the model is trained using labeled data that means the input data comes with coressponding output labels. The model learns to map inputs to outputs and make predictions based on past data.

Key Characteristics:

- Requires labeled dataset.
- The model learns by minimizing errors in predictions.
- Used for Classification and Regression tasks.

Examples: 
- Spam detection: Classifies emails as spam or not based on labeled examples.
- Fraud detection: Identifies fraudulent transactions from historical labeled data.
- Speech recognition: Converts speech into text using labeled transcripts.

-> Unsupervised Learning:

- In Unsupervised Learning, the model is trained using unlabeled data that means it must find patterns and structure on its own without explicit instructions.

Key Characteristics:

- No predefined labels; the algorithm identifies patterns or clusters in the data.
- Used for clustering, association, and anomaly detection tasks.

Examples:

- Customer segmentation: Groups customers based on their purchasing behavior.
- Topic modeling: Finds topics in a collection of text documents.

## Practical Tasks

Task 1: Linear Regression for House Price Prediction

- Use the Boston Housing Dataset. 
- Train a Linear Regression Model to predict house prices.
- Deliverable: Jupyter Notebook with model performance metrics

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [9]:
housing_price_df = pd.read_csv('BostonHousing.csv')
housing_price_df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [10]:
housing_price_df.rename(columns={'crim' : 'CrimeRate', 'zn' : 'LargeLots', 'indus' : 'NonRetailAcres',
                                 'chas' : 'NearCharlesRiver', 'nox' : 'NOXCocentration',
                                 'rm' : 'AvgRooms', 'age' : 'OldHomesProportion', 'dis' : 'EmploymentDist',
                                 'rad' : 'HighwayExcess', 'tax' : 'PropertyTaxRate'}, inplace=True)
housing_price_df.head()

Unnamed: 0,CrimeRate,LargeLots,NonRetailAcres,NearCharlesRiver,NOXCocentration,AvgRooms,OldHomesProportion,EmploymentDist,HighwayExcess,PropertyTaxRate,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [6]:
housing_price_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   CrimeRate           506 non-null    float64
 1   LargeLots           506 non-null    float64
 2   NonRetailAcres      506 non-null    float64
 3   NearCharlesRiver    506 non-null    int64  
 4   NOXCocentration     506 non-null    float64
 5   AvgRooms            501 non-null    float64
 6   OldHomesProportion  506 non-null    float64
 7   EmploymentDist      506 non-null    float64
 8   HighwayExcess       506 non-null    int64  
 9   PropertyTaxRate     506 non-null    int64  
 10  ptratio             506 non-null    float64
 11  b                   506 non-null    float64
 12  lstat               506 non-null    float64
 13  medv                506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


In [8]:
housing_price_df.isnull().sum()

CrimeRate             0
LargeLots             0
NonRetailAcres        0
NearCharlesRiver      0
NOXCocentration       0
AvgRooms              5
OldHomesProportion    0
EmploymentDist        0
HighwayExcess         0
PropertyTaxRate       0
ptratio               0
b                     0
lstat                 0
medv                  0
dtype: int64

In [13]:
housing_price_df['AvgRooms'][housing_price_df['AvgRooms'].isnull()] = housing_price_df['AvgRooms'].dropna().sample(housing_price_df['AvgRooms'].isnull().sum()).values

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  housing_price_df['AvgRooms'][housing_price_df['AvgRooms'].isnull()] = housing_price_df['AvgRooms'].dropna().sample(housing_price_df['AvgRooms'].isnull().sum()).values
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:

In [14]:
housing_price_df.isnull().sum()

CrimeRate             0
LargeLots             0
NonRetailAcres        0
NearCharlesRiver      0
NOXCocentration       0
AvgRooms              0
OldHomesProportion    0
EmploymentDist        0
HighwayExcess         0
PropertyTaxRate       0
ptratio               0
b                     0
lstat                 0
medv                  0
dtype: int64