# Linear Regression
## Question 1
Make a class called LinearRegression which provides two functions : fit and predict. Try to implement it from scratch. If stuck, refer to the examples folder.

In [41]:
import matplotlib.pyplot as plt
class LinearRegression:
    def __init__(self):
        self.coefficients = None

    def fit(self, X, y):
        # Add a column of ones to X for the intercept term
        X_with_intercept = self.add_intercept(X)
        
        # Calculate coefficients using the normal equation (closed-form solution)
        self.coefficients = self.normal_equation(X_with_intercept, y)

    def predict(self, X):
        # Add a column of ones to X for the intercept term
        X_with_intercept = self.add_intercept(X)
        
        # Make predictions
        predictions = self.hypothesis(X_with_intercept)
        return predictions

    def add_intercept(self, X):
        # Add a column of ones to X for the intercept term
        intercept_column = np.ones((X.shape[0], 1))
        return np.concatenate((intercept_column, X), axis=1)

    def hypothesis(self, X):
        # Compute the hypothesis (predictions)
        return np.dot(X, self.coefficients)

    def normal_equation(self, X, y):
        # Compute coefficients using the normal equation (closed-form solution)
        return np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)



## Question 2

Use the dataset https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction (*).
1. Read it using pandas.
2. Check for **null values**.
3. For each of the columns (except the first and last), plot the column values in the X-axis against the last column of prices in the Y-axis.
4. Remove the unwanted columns.
5. Split the dataset into train and test data. Test data size = 25% of total dataset.
6. **Normalize** the X_train and X_test using MinMaxScaler from sklearn.preprocessing.
7. Fit the training data into the model created in question 1 and predict the testing data.
8. Use **mean square error and R<sup>2</sup>** from sklearn.metrics as evaluation criterias.
9. Fit the training data into the models of the same name provided by sklearn.linear_model and evaluate the predictions using MSE and R<sup>2</sup>.
10. Tune the hyperparameters of your models (learning rate, epochs) to achieve losses close to that of the sklearn models.

Note : (*) To solve this question, you may proceed in any of the following ways :
1. Prepare the notebook in Kaggle, download it and submit it separately with the other questions.
2. Download the dataset from kaggle. Upload it to the session storage in Colab.
3. Use Colab data directly in Colab. [Refer here](https://www.kaggle.com/general/74235). For this, you need to create kaggle API token. Before submitting, hide or remove the API token.

In [14]:
#1
import pandas as pd
data_set=pd.read_csv("Real estate.csv")
data_set.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


In [5]:
#2
missing_values = data_set.isnull().sum()

if missing_values.any():
    print("Missing values", df.shape)
else:
    print("No missing values found in the Dataframe.")

No missing values found in the Dataframe.


In [12]:
#3
import matplotlib.pyplot as plt
columns_to_plot = data_set.columns[1:-1]

for column in columns_to_plot:
    plt.figure(figsize=(8, 6))
    plt.scatter(data_set[column], data_set['Y house price of unit area'], alpha=0.5)
    plt.title(f'{column} vs Prices')
    plt.xlabel(column)
    plt.ylabel('Prices')
    plt.show()



In [16]:
#4
data_set_updated=data_set.drop(['X4 number of convenience stores'],axis=1)
data_set_updated.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,24.97937,121.54245,43.1


In [18]:
#5
from sklearn.model_selection import train_test_split
X = data_set_updated.iloc[:, :-1] 
y = data_set_updated.iloc[:, -1] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (310, 6)
Shape of X_test: (104, 6)
Shape of y_train: (310,)
Shape of y_test: (104,)


In [19]:
#6
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)
print("Shape of X_train_normalized:",X_train_normalized.shape)
print("Shape of X_test_normalized:",X_test_normalized.shape)

Shape of X_train_normalized: (310, 6)
Shape of X_test_normalized: (104, 6)


In [28]:
#7 & 8
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()

model.fit(X_train_normalized, y_train)
predictions = model.predict(X_test_normalized)

mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)


Mean Squared Error: 74.83914472084611
R^2 Score: 0.5281818131421638


In [31]:
#10
class CustomLinearRegression:
    def __init__(self, learning_rate=0.01, epochs=1000):
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.coefficients = None

    def fit(self, X, y):
        X_with_intercept = self.add_intercept(X)
        self.coefficients = np.zeros(X_with_intercept.shape[1])
        for _ in range(self.epochs):
            gradient = self.compute_gradient(X_with_intercept, y)
            self.coefficients -= self.learning_rate * gradient

    def predict(self, X):
        X_with_intercept = self.add_intercept(X)
        predictions = self.hypothesis(X_with_intercept)
        return predictions

    def add_intercept(self, X):
        intercept_column = np.ones((X.shape[0], 1))
        return np.concatenate((intercept_column, X), axis=1)

    def hypothesis(self, X):
        return np.dot(X, self.coefficients)

    def compute_gradient(self, X, y):
        predictions = self.hypothesis(X)
        error = predictions - y
        gradient = np.dot(X.T, error) / len(y)
        return gradient

model_custom = CustomLinearRegression(learning_rate=0.01, epochs=1000)
model_custom.fit(X_train_normalized, y_train)
predictions_custom = model_custom.predict(X_test_normalized)
mse_custom = mean_squared_error(y_test, predictions_custom)
r2_custom = r2_score(y_test, predictions_custom)
print("Custom Linear Regression Model:")
print("Mean Squared Error:", mse_custom)
print("R^2 Score:", r2_custom)


Custom Linear Regression Model:
Mean Squared Error: 86.43626588236094
R^2 Score: 0.45506856873502355


# Logistic Regression
## Question 3

The breast cancer dataset is a binary classification dataset commonly used in machine learning tasks. It is available in scikit-learn (sklearn) as part of its datasets module.
Here is an explanation of the breast cancer dataset's components:

* Features (X):

 * The breast cancer dataset consists of 30 numeric features representing different characteristics of the FNA images. These features include mean, standard error, and worst (largest) values of various attributes such as radius, texture, smoothness, compactness, concavity, symmetry, fractal dimension, etc.

* Target (y):

 * The breast cancer dataset is a binary classification problem, and the target variable (y) represents the diagnosis of the breast mass. It contains two classes:
    * 0: Represents a malignant (cancerous) tumor.
    * 1: Represents a benign (non-cancerous) tumor.

Complete the code given below in place of the "..."

1. Load the dataset from sklearn.datasets
2. Separate out the X and Y columns.
3. Normalize the X data using MinMaxScaler or StandardScaler.
4. Create a train-test-split. Take any suitable test size.

In [33]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
print(data.DESCR)
print("Feature names:")
print(data.feature_names)
print("Target names:")
print(data.target_names)
print("Shape of feature data (X):", data.data.shape)
print("Shape of target data (y):", data.target.shape)

In [34]:
data = load_breast_cancer()
X = data.data
y = data.target
print("Shape of feature data (X):", X.shape)
print("Shape of target data (y):", y.shape)

Shape of feature data (X): (569, 30)
Shape of target data (y): (569,)


In [48]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
print("Shape of normalized feature data (X):", X_normalized.shape)
import matplotlib.pyplot as plt
feature1 = 0 
feature2 = 1  

plt.figure(figsize=(8, 6))
plt.scatter(X_normalized[:, feature1], X_normalized[:, feature2], c=y, cmap=plt.cm.coolwarm, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatter Plot of Normalized Feature Data')
plt.colorbar(label='Target')
plt.grid(True)
plt.show()

Shape of normalized feature data (X): (569, 30)


In [47]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42)
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
feature1 = 0 
feature2 = 1  
plt.figure(figsize=(8, 6))
plt.scatter(X_train[:, feature1], X_train[:, feature2], c=y_train, cmap=plt.cm.coolwarm, alpha=0.5, label='Training Set')
plt.scatter(X_test[:, feature1], X_test[:, feature2], c=y_test, cmap=plt.cm.coolwarm, alpha=0.5, marker='x', label='Testing Set')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Train-Test-Split of Normalized Feature Data')
plt.colorbar(label='Target')
plt.legend()
plt.grid(True)
plt.show()


Shape of X_train: (455, 30)
Shape of X_test: (114, 30)
Shape of y_train: (455,)
Shape of y_test: (114,)


5. Write code for the sigmoid function and Logistic regression.


In [72]:
import numpy as np
def sigmoid(z):
    sig = 1/(1+np.exp(-z))
    return sig
    
def sigmoid_derivative(z):
    sig = sigmoid(x)
    return sig*(1 - sig)

class LogisticRegression:
    # def sigmoid(self,z=None):
    #     if z is None:
    #         z=z.self
    #     sig = 1/(1+np.exp(-z))
    #     return sig
    
    def sigmoid_derivative(z):
        sig = sigmoid(x)
        return sig*(1 - sig)

    
    def __init__(self, learning_rate, epochs):
      #Initialise the hyperparameters of the model
        self.lr = learning_rate
        self.epochs = epochs

    def sigmoid(self,z):
        sig = 1/(1+np.exp(-z))
        return sig
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = y.reshape(-1, 1)
        self.weights = np.zeros((n_features,1))
        self.bias = 0
        #Implement the GD algortihm
        for _ in range(self.epochs):
            z = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(z)

            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)

            self.weights -= self.lr*dw
            self.bias -= self.lr*db
    
    def predict(self, X):
        #Write the predict function
        z = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(z)   
        return np.round(y_pred)

6. Fit your model on the dataset and make predictions.
7. Compare your model with the Sklearn Logistic Regression model. Try out all the different penalties.
8. Print accuracy_score in each case using sklearn.metrics .

In [76]:
model = LogisticRegression(learning_rate=0.01, epochs=1000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Predictions:", predictions[:10])

Predictions: [[1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]]


9. For the best model in each case (yours and scikit-learn), print the classification_report using sklearn.metrics .
10. For the best model in each case (yours and scikit-learn), print the confusion_matrix using sklearn.metrics .

# KNN
## Question 4

How accurately can a K-Nearest Neighbors (KNN) model classify different types of glass based on a glass classification dataset consisting of 214 samples and 7 classes? Use the kaggle dataset "https://www.kaggle.com/datasets/uciml/glass".

Context: This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 values)

1. Load the data as you did in the 2nd question.
2. Extract the X and Y columns.
3. Split it into training and testing datasets.

In [77]:
#1
df=pd.read_csv("glass.csv")
df.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [79]:
#2
x=df.drop(columns=['Type'])
y=df['Type']

In [80]:
#3 spliting data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

4. Define Euclidean distance.
5. Build the KNN model.
6. Fit the model on the training data. (Note : you may require to change the type of the data from pandas dataframe to numpy arrays. To do that, just do this X=np.array(X) and so on...)

In [81]:
def euclidean_distance(x1,x2):
    return np.sqrt(np.sum((x1-x2)**2))

In [83]:
class KNN(object):
    def __init__(self,k):
        self.k=k
    def fit(self,x_train,y_train):
        self.x_train=x_train
        self.y_train=y_train
    def predict(self,x_test):
        predictions=[self._helper(x) for x in x_test]
        return np.array(predictions)
    def _helper(self,x):
        prediction=[euclidean_distance(x,x1) for x1 in self.x_train]
        indices= np.argsort(prediction)[:self.k]
        labels= [self.y_train[i] for i in indices]
        c=Counter(labels).most_common()
        return c[0][0]

def accuracy(predictions,y_test):
    return np.sum(predictions==y_test)/len(y_test)

In [85]:
X_train=np.array(X_train)
X_test=np.array(X_test)
y_train=np.array(y_train)
y_test=np.array(y_test)

In [87]:
from collections import Counter
clf=KNN(k=3)
clf.fit(X_train,y_train)
predictions=clf.predict(X_test)
print(accuracy(predictions,y_test))

0.7441860465116279


In [90]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train, y_train)
Y_pred = knn_classifier.predict(X_test)
accuracy = accuracy_score(y_test, Y_pred)
print("Accuracy of KNN model (using scikit-learn):", accuracy)

Accuracy of KNN model (using scikit-learn): 0.6511627906976745


7. Make predictions. Find their accuracy using accuracy_score. Try different k values. k=3 worked well in our case.
8. Compare with the sklearn model (from sklearn.neighbors import KNeighborsClassifier)