# Linear Regression

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable

This week, your task involves conducting multi-class linear regression on batsmen salaries. You'll use the average runs scored per game and the strike rate as independent variables. The goal is to predict the salary as the dependent variable. Additionally, you'll be categorizing the data based on the years.

The dataset is Data_Mendeley.csv given on GitHub. Feel free to create any new functions required.

In [5]:
#import important libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

preparing data

In [2]:
df = pd.read_csv("C:\\Users\\medha\\Downloads\\Data_Mendeley.csv")
df.head()

Unnamed: 0,Id,Name,Year,Final Price,Role,Nationality,Team,Ent,Age,Matches,...,LWkts,Econ,LEcon,FourWkts,LFourWkts,FiveWkts,LFiveWkts,Indian,Specialist,Status
0,1,AB de Villiers,2008,12048000,Wicketkeeper batsman,South African,DD,0,24,6,...,0,0.0,0.0,0,0,0,0,0,1,218
1,1,AB de Villiers,2009,14736000,Wicketkeeper batsman,South African,DD,0,25,15,...,0,0.0,0.0,0,0,0,0,0,1,329
2,1,AB de Villiers,2010,13887000,Wicketkeeper batsman,South African,DD,0,26,7,...,0,0.0,0.0,0,0,0,0,0,1,885
3,1,AB de Villiers,2011,50600000,Wicketkeeper batsman,South African,RCB,0,27,16,...,0,0.0,0.0,0,0,0,0,0,1,1316
4,1,AB de Villiers,2012,55297000,Wicketkeeper batsman,South African,RCB,0,28,16,...,0,0.0,0.0,0,0,0,0,0,1,1075


In [3]:
X = df[['Ave', 'StrRate']]
print(X.head())

     Ave  StrRate
0  19.00    96.93
1  51.66   130.98
2  15.85    93.27
3  34.66   128.39
4  39.87   161.11


In [6]:
y = df[['Final Price']]
print(y.head())

   Final Price
0     12048000
1     14736000
2     13887000
3     50600000
4     55297000


In [14]:
X = sm.add_constant(X)
model = sm.OLS(y,X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            Final Price   R-squared:                       0.130
Model:                            OLS   Adj. R-squared:                  0.128
Method:                 Least Squares   F-statistic:                     82.21
Date:                Sat, 30 Dec 2023   Prob (F-statistic):           5.17e-34
Time:                        19:30:39   Log-Likelihood:                -20572.
No. Observations:                1108   AIC:                         4.115e+04
Df Residuals:                    1105   BIC:                         4.116e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.199e+07    1.6e+06      7.476      0.0

Mean Squared Loss

In [29]:
y = df['Final Price'] 
y_pred = model.predict(X)  
df['e'] = (y - y_pred)**2
print(df['e'].head())
mse = df['e'].mean()
print("Mean Square Error: ", mse)

0    2.195711e+14
1    1.051598e+15
2    1.211368e+14
3    1.795243e+14
4    1.896409e+14
Name: e, dtype: float64
Mean Square Error:  784150813120748.9


# Logistic Regression

Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on.

In this week you will be doing logistic regression on breast cancer dataset using sklearn library. Feel free to create any new functions required.

In [35]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [36]:
# Load the breast cancer dataset
breast_cancer = datasets.load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target

In [37]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [38]:
# Standardize the features using StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [40]:
# Instantiate the logistic regression model
model = LogisticRegression(random_state=1234)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.956140350877193


In [31]:
def BCELoss(y, y_pred):
    # Ensure inputs are numpy arrays to handle both lists and numpy arrays
    y = np.array(y)
    y_pred = np.array(y_pred)

    # Clip predictions to avoid log(0) issues
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)

    # Calculate binary cross-entropy loss
    loss = -(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred)).mean()

    return loss

In [34]:
print(BCELoss(y_test, y_pred))

1.5148936838746005
