<a href="https://colab.research.google.com/github/AKHILM20DS007/AKHILM20DS007/blob/main/(logistic_regression_and_KNN)_NBA_PLAYERS_INJURY_ANALYSIS_WITH_DS_MODELS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:

import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, plot_roc_curve, accuracy_score
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-dark')

In [3]:
df = pd.read_csv('/content/drive/MyDrive/1Major project/injuries_2010-2020.csv')

This dataset contains NBA players' injuries between 2010 and 2020.

First injury record was recorded on October 2010 and the last one on October 2020.

To find out the injury history of the active NBA players (21-22 season), we'll import the active players dataset and merge them.

In [4]:
df.tail()

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
27100,30-09-2020,Lakers,Dion Waiters,,activated from IL
27101,02-10-2020,Heat,,Bam Adebayo,strained neck (DTD)
27102,02-10-2020,Heat,,Goran Dragic,placed on IL with torn plantar fascia in left ...
27103,02-10-2020,Heat,Chris Silva,,activated from IL
27104,06-10-2020,Heat,Bam Adebayo,,returned to lineup


In [5]:
df.head()

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
0,03-10-2010,Bulls,,Carlos Boozer,fractured bone in right pinky finger (out inde...
1,06-10-2010,Pistons,,Jonas Jerebko,torn right Achilles tendon (out indefinitely)
2,06-10-2010,Pistons,,Terrico White,broken fifth metatarsal in right foot (out ind...
3,08-10-2010,Blazers,,Jeff Ayres,torn ACL in right knee (out indefinitely)
4,08-10-2010,Nets,,Troy Murphy,strained lower back (out indefinitely)


In [6]:
df.isnull().sum()

Date                0
Team                2
Acquired        17563
Relinquished     9545
Notes               0
dtype: int64

We will drop the Acquired and Relinquished columns and combine them under the column "Name"

In [7]:
df['Name'] = df[["Acquired","Relinquished"]].fillna('').sum(axis=1)

In [8]:
df = df.drop(columns=['Acquired', 'Relinquished'])

df.head()

Unnamed: 0,Date,Team,Notes,Name
0,03-10-2010,Bulls,fractured bone in right pinky finger (out inde...,Carlos Boozer
1,06-10-2010,Pistons,torn right Achilles tendon (out indefinitely),Jonas Jerebko
2,06-10-2010,Pistons,broken fifth metatarsal in right foot (out ind...,Terrico White
3,08-10-2010,Blazers,torn ACL in right knee (out indefinitely),Jeff Ayres
4,08-10-2010,Nets,strained lower back (out indefinitely),Troy Murphy


In [9]:
df.isnull().sum()

Date     0
Team     2
Notes    0
Name     0
dtype: int64

In [10]:
df["Team"].value_counts()

Spurs           1163
Bucks           1068
Warriors        1060
Rockets         1058
Raptors         1044
Celtics         1040
Nets            1024
Heat            1023
Cavaliers       1001
Mavericks        992
Hawks            975
Nuggets          966
Lakers           959
Knicks           943
76ers            910
Wizards          875
Grizzlies        875
Timberwolves     860
Jazz             841
Magic            834
Pacers           831
Bulls            791
Suns             733
Kings            728
Hornets          719
Clippers         718
Thunder          717
Pistons          714
Blazers          695
Pelicans         576
Bobcats          369
Bullets            1
Name: Team, dtype: int64

Changing the team names to full names.

In [11]:
dict = {"Rockets":"Houston Rockets",
        "Magic":"Orlando Magic",
        "Nets":"Brooklyn Nets",
        "76ers":"Philadelphia Sixers",
        "Cavaliers":"Cleveland Cavaliers",
        "Kings":"Sacramento Kings",
        "Pacers":"Indiana Pacers",
        "Bucks":"Milwaukee Bucks",
        "Celtics":"Boston Celtics",
        "Pelicans":"New Orleans Pelicans",
        "Clippers":"Los Angeles Clippers",
        "Nuggets":"Denver Nuggets",
        "Wizards":"Washington Wizards",
        "Bullets":"Washington Bullets",
        "Thunder":"Oklahoma City Thunder",
        "Raptors":"Toronto Raptors",
        "Bulls":"Chicago Bulls",
        "Lakers":"Los Angeles Lakers",
        "Grizzlies":"Memphis Grizzlies",
        "Hawks":"Atlanta Hawks",
        "Heat":"Miami Heat",
        "Spurs":"San Antonio Spurs",
        "Mavericks":"Dallas Mavericks",
        "Jazz":"Utah Jazz",
        "Hornets":"Charlotte Hornets",
        "Bobcats":"Charlotte Bobcats",
        "Pistons":"Detroit Pistons",
        "Warriors":"Golden State Warriors",
        "Timberwolves":"Minnesota Timberwolves",
        "Suns":"Phoenix Suns",
        "Knicks":"New York Knicks",
        "Blazers":"Portland Trailblazers"
    
}

In [12]:
df["Team"].replace(dict, inplace=True)

df.tail(10)

Unnamed: 0,Date,Team,Notes,Name
27095,12-09-2020,Houston Rockets,placed on IL with neck spasms (out for season),Tyson Chandler
27096,18-09-2020,Boston Celtics,placed on IL with strained right adductor (out...,Romeo Langford
27097,22-09-2020,Boston Celtics,surgery on right wrist (out for season),Romeo Langford
27098,23-09-2020,Miami Heat,sore right knee (DTD),Gabe Vincent
27099,30-09-2020,Miami Heat,strained left shoulder (DTD),Bam Adebayo
27100,30-09-2020,Los Angeles Lakers,activated from IL,Dion Waiters
27101,02-10-2020,Miami Heat,strained neck (DTD),Bam Adebayo
27102,02-10-2020,Miami Heat,placed on IL with torn plantar fascia in left ...,Goran Dragic
27103,02-10-2020,Miami Heat,activated from IL,Chris Silva
27104,06-10-2020,Miami Heat,returned to lineup,Bam Adebayo


In [13]:
df["Team"].value_counts()

San Antonio Spurs         1163
Milwaukee Bucks           1068
Golden State Warriors     1060
Houston Rockets           1058
Toronto Raptors           1044
Boston Celtics            1040
Brooklyn Nets             1024
Miami Heat                1023
Cleveland Cavaliers       1001
Dallas Mavericks           992
Atlanta Hawks              975
Denver Nuggets             966
Los Angeles Lakers         959
New York Knicks            943
Philadelphia Sixers        910
Washington Wizards         875
Memphis Grizzlies          875
Minnesota Timberwolves     860
Utah Jazz                  841
Orlando Magic              834
Indiana Pacers             831
Chicago Bulls              791
Phoenix Suns               733
Sacramento Kings           728
Charlotte Hornets          719
Los Angeles Clippers       718
Oklahoma City Thunder      717
Detroit Pistons            714
Portland Trailblazers      695
New Orleans Pelicans       576
Charlotte Bobcats          369
Washington Bullets           1
Name: Te

Changing the order of the rows

In [14]:
df = df[["Name", "Team", "Date", "Notes"]]

df.tail()

Unnamed: 0,Name,Team,Date,Notes
27100,Dion Waiters,Los Angeles Lakers,30-09-2020,activated from IL
27101,Bam Adebayo,Miami Heat,02-10-2020,strained neck (DTD)
27102,Goran Dragic,Miami Heat,02-10-2020,placed on IL with torn plantar fascia in left ...
27103,Chris Silva,Miami Heat,02-10-2020,activated from IL
27104,Bam Adebayo,Miami Heat,06-10-2020,returned to lineup


In [15]:
df.dtypes

Name     object
Team     object
Date     object
Notes    object
dtype: object

Changing the "Date" column to datetime to be able to format the dates.

In [16]:
df['Date'] = pd.to_datetime(df.Date)

df['Date'] = df['Date'].dt.strftime('%d/%m/%Y')

df.tail()

Unnamed: 0,Name,Team,Date,Notes
27100,Dion Waiters,Los Angeles Lakers,30/09/2020,activated from IL
27101,Bam Adebayo,Miami Heat,10/02/2020,strained neck (DTD)
27102,Goran Dragic,Miami Heat,10/02/2020,placed on IL with torn plantar fascia in left ...
27103,Chris Silva,Miami Heat,10/02/2020,activated from IL
27104,Bam Adebayo,Miami Heat,10/06/2020,returned to lineup


In [17]:
df.dtypes

Name     object
Team     object
Date     object
Notes    object
dtype: object

Importing active player dataset for 21-22 season.

This dataset includes Name, Team, Position, Age, Height, Weight, College and Salary information of the 2021-22 season NBA players.

We will only use Name, Team and Position information for now.

In [18]:
df_act = pd.read_csv("/content/drive/MyDrive/1Major project/active_players_2.csv")

#df_act = df_act[["Name","Team","Position"]]

df_act.head()

Unnamed: 0,Name,Team,Position,Age,Height,Height_i,Weight,College,Salary
0,Juhann Begarin,Boston Celtics,SG,19,"6' 5""",6.5,185,,
1,Jaylen Brown,Boston Celtics,SG,24,"6' 6""",6.6,223,California,26758928.0
2,Kris Dunn,Boston Celtics,PG,27,"6' 3""",6.3,205,Providence,5005350.0
3,Carsen Edwards,Boston Celtics,PG,23,"5' 11""",5.11,200,Purdue,1782621.0
4,Tacko Fall,Boston Celtics,C,25,"7' 5""",7.5,311,UCF,


In [19]:
df_act = df_act[["Name","Team","Position"]]

df_act.head()

Unnamed: 0,Name,Team,Position
0,Juhann Begarin,Boston Celtics,SG
1,Jaylen Brown,Boston Celtics,SG
2,Kris Dunn,Boston Celtics,PG
3,Carsen Edwards,Boston Celtics,PG
4,Tacko Fall,Boston Celtics,C


Merging the datasets by performing a left merge on Name.

In [20]:
result_df = pd.merge(df_act, df, how= "left", on=["Name"])
result_df = result_df[result_df['Date'].notna()]

result_df.head()

Unnamed: 0,Name,Team_x,Position,Team_y,Date,Notes
1,Jaylen Brown,Boston Celtics,SG,Boston Celtics,01/11/2017,sprained right ankle (DTD)
2,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/01/2017,returned to lineup
3,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/02/2017,placed on IL with strained right hip flexor
4,Jaylen Brown,Boston Celtics,SG,Boston Celtics,24/02/2017,activated from IL
5,Jaylen Brown,Boston Celtics,SG,Boston Celtics,12/06/2017,placed on IL with right eye inflammation


In [21]:
result_df.tail()

Unnamed: 0,Name,Team_x,Position,Team_y,Date,Notes
9489,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,13/01/2020,returned to lineup
9490,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,02/07/2020,bruised left leg (DTD)
9491,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,02/11/2020,returned to lineup
9492,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,08/08/2020,placed on IL with strained left hip
9493,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,08/11/2020,activated from IL


9493 injury records of active players are available

Looking for a specific player's injury records for the last 10 years

# # we will Clear Null entries/ values now

In [22]:
result_df
#result_df.isnull().sum()

Unnamed: 0,Name,Team_x,Position,Team_y,Date,Notes
1,Jaylen Brown,Boston Celtics,SG,Boston Celtics,01/11/2017,sprained right ankle (DTD)
2,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/01/2017,returned to lineup
3,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/02/2017,placed on IL with strained right hip flexor
4,Jaylen Brown,Boston Celtics,SG,Boston Celtics,24/02/2017,activated from IL
5,Jaylen Brown,Boston Celtics,SG,Boston Celtics,12/06/2017,placed on IL with right eye inflammation
...,...,...,...,...,...,...
9489,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,13/01/2020,returned to lineup
9490,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,02/07/2020,bruised left leg (DTD)
9491,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,02/11/2020,returned to lineup
9492,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,08/08/2020,placed on IL with strained left hip


In [23]:
result_df[result_df["Name"] == "Jaylen Brown"]

Unnamed: 0,Name,Team_x,Position,Team_y,Date,Notes
1,Jaylen Brown,Boston Celtics,SG,Boston Celtics,01/11/2017,sprained right ankle (DTD)
2,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/01/2017,returned to lineup
3,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/02/2017,placed on IL with strained right hip flexor
4,Jaylen Brown,Boston Celtics,SG,Boston Celtics,24/02/2017,activated from IL
5,Jaylen Brown,Boston Celtics,SG,Boston Celtics,12/06/2017,placed on IL with right eye inflammation
6,Jaylen Brown,Boston Celtics,SG,Boston Celtics,12/08/2017,activated from IL
7,Jaylen Brown,Boston Celtics,SG,Boston Celtics,21/12/2017,sore left Achilles (DTD)
8,Jaylen Brown,Boston Celtics,SG,Boston Celtics,23/12/2017,returned to lineup
9,Jaylen Brown,Boston Celtics,SG,Boston Celtics,27/12/2017,placed on IL
10,Jaylen Brown,Boston Celtics,SG,Boston Celtics,27/12/2017,sore right knee (DTD)


In [24]:
result_df.describe()

Unnamed: 0,Name,Team_x,Position,Team_y,Date,Notes
count,9280,9280,9280,9280,9280,9280
unique,344,30,7,31,1734,1552
top,Kevin Love,Los Angeles Lakers,C,Boston Celtics,04/12/2017,activated from IL
freq,139,715,1943,458,35,2480


In [25]:
result_df.isnull().sum()

Name        0
Team_x      0
Position    0
Team_y      0
Date        0
Notes       0
dtype: int64

In [26]:
result_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9280 entries, 1 to 9493
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      9280 non-null   object
 1   Team_x    9280 non-null   object
 2   Position  9280 non-null   object
 3   Team_y    9280 non-null   object
 4   Date      9280 non-null   object
 5   Notes     9280 non-null   object
dtypes: object(6)
memory usage: 507.5+ KB


##Saved file: Injury_History.csv

In [27]:
filename = 'CREATED Active Players Injury_History.csv'

result_df.to_csv(filename,index=False)

print('Saved file: ' + filename)

Saved file: CREATED Active Players Injury_History.csv


In [28]:
result_df

Unnamed: 0,Name,Team_x,Position,Team_y,Date,Notes
1,Jaylen Brown,Boston Celtics,SG,Boston Celtics,01/11/2017,sprained right ankle (DTD)
2,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/01/2017,returned to lineup
3,Jaylen Brown,Boston Celtics,SG,Boston Celtics,13/02/2017,placed on IL with strained right hip flexor
4,Jaylen Brown,Boston Celtics,SG,Boston Celtics,24/02/2017,activated from IL
5,Jaylen Brown,Boston Celtics,SG,Boston Celtics,12/06/2017,placed on IL with right eye inflammation
...,...,...,...,...,...,...
9489,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,13/01/2020,returned to lineup
9490,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,02/07/2020,bruised left leg (DTD)
9491,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,02/11/2020,returned to lineup
9492,Hassan Whiteside,Utah Jazz,C,Portland Trailblazers,08/08/2020,placed on IL with strained left hip


In [29]:
X = result_df[['Name','Team_x','Position','Team_y','Date']]   ## feature
Y = result_df['Notes']   ## target

In [30]:
#X=result_df.iloc[:,:-1].values
#Y=result_df.iloc[:,-1].values

print(X)
print(Y)

                  Name          Team_x Position                 Team_y  \
1         Jaylen Brown  Boston Celtics       SG         Boston Celtics   
2         Jaylen Brown  Boston Celtics       SG         Boston Celtics   
3         Jaylen Brown  Boston Celtics       SG         Boston Celtics   
4         Jaylen Brown  Boston Celtics       SG         Boston Celtics   
5         Jaylen Brown  Boston Celtics       SG         Boston Celtics   
...                ...             ...      ...                    ...   
9489  Hassan Whiteside       Utah Jazz        C  Portland Trailblazers   
9490  Hassan Whiteside       Utah Jazz        C  Portland Trailblazers   
9491  Hassan Whiteside       Utah Jazz        C  Portland Trailblazers   
9492  Hassan Whiteside       Utah Jazz        C  Portland Trailblazers   
9493  Hassan Whiteside       Utah Jazz        C  Portland Trailblazers   

            Date  
1     01/11/2017  
2     13/01/2017  
3     13/02/2017  
4     24/02/2017  
5     12/06/2017

## LABEL ENCODING TO CONVERT STRING TO INT

In [31]:
# Import label encoder
from sklearn import preprocessing

# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
result_df['Name']= label_encoder.fit_transform(result_df['Name'])

result_df['Name'].unique() #unique function returns values of the specified column values


array([154, 205, 302,  34,   3, 103, 222, 276, 139, 175,  80, 227, 156,
       271, 211,  33, 161, 249, 285, 195,  22, 146, 163, 210,   8, 147,
       327, 313, 262, 264, 141,  96,   5, 108, 303, 148, 246, 111, 253,
       177, 238,  85, 190, 288,  10, 165,  64, 314, 113, 290, 121,  20,
       232, 200, 131,  46, 284, 124, 294, 258, 119, 112, 340, 283, 217,
       316, 324,   6,  74,  54, 153,  84, 342,   2,  82, 251, 151,  60,
        68, 197, 214,  42, 269,  76,  99, 128, 159, 109, 173,  53, 322,
       275, 188,  81, 123, 254, 223, 318, 178, 160,  40, 187, 300, 136,
       101, 242, 301, 127, 122, 309, 259,  92, 120, 176, 274, 179, 273,
        32, 201, 286,  24, 158,  17,  25,  50, 126,  63, 292, 196,  72,
        37, 218,  79, 319,  21, 125, 335,  51, 189, 231, 307, 133,  28,
       257,  18, 162,  89, 329, 326, 209, 229, 330, 233, 256, 114, 239,
       333, 235, 228, 117, 167, 272,  44, 306,  29,  70, 310, 194, 293,
        61, 281, 241, 208, 116, 266, 169, 245,  47, 295,  95,  1

In [32]:
result_df['Team_x']= label_encoder.fit_transform(result_df['Team_x'])

result_df['Team_x'].unique()

array([ 1,  2, 19, 22, 27,  4,  5,  8, 11, 16,  0,  3, 15, 21, 29,  9, 12,
       13, 23, 25,  6, 10, 14, 18, 26,  7, 17, 20, 24, 28])

In [33]:
result_df['Position']= label_encoder.fit_transform(result_df['Position'])

result_df['Position'].unique()


array([6, 4, 0, 1, 3, 5, 2])

In [34]:
result_df['Team_y']= label_encoder.fit_transform(result_df['Team_y'])

result_df['Team_y'].unique()


array([ 1, 18,  5,  0, 23, 29, 21, 20, 17, 30, 26, 16, 25, 27,  9, 15,  2,
       10, 13, 11,  6, 22, 12, 28, 24,  8, 19,  4,  7, 14,  3])

In [35]:
result_df['Date']= label_encoder.fit_transform(result_df['Date'])

result_df['Date'].unique()


array([  92,  730,  736, ...,  322,  249, 1539])

In [36]:
result_df['Notes']= label_encoder.fit_transform(result_df['Notes'])

result_df['Notes'].unique()

array([1225,  989,  860, ..., 1543, 1323, 1181])

In [37]:
X= result_df[['Name','Team_x','Position','Date']]
#X = np.array(result_df[['Name']])  ## feature

In [38]:
X.shape # x should be in 2-d

(9280, 4)

In [39]:
Y = result_df['Notes']  ## target

In [40]:
Y.shape # y can be 1d or 2d 

(9280,)

In [41]:
Y.value_counts()

10      2480
989     1035
330      627
977      147
537      117
        ... 
329        1
1347       1
864        1
650        1
1181       1
Name: Notes, Length: 1552, dtype: int64

In [42]:
X.value_counts()

Name  Team_x  Position  Date
193   13      5         903     2
138   23      0         1645    2
229   15      3         1609    2
100   23      6         393     2
229   15      3         1113    2
                               ..
107   28      3         760     1
                        740     1
                        679     1
                        643     1
343   18      3         1247    1
Length: 9141, dtype: int64

In [43]:
result_df.isnull().sum() # checking null values

Name        0
Team_x      0
Position    0
Team_y      0
Date        0
Notes       0
dtype: int64

In [44]:
result_df

Unnamed: 0,Name,Team_x,Position,Team_y,Date,Notes
1,154,1,6,1,92,1225
2,154,1,6,1,730,989
3,154,1,6,1,736,860
4,154,1,6,1,1336,10
5,154,1,6,1,670,634
...,...,...,...,...,...,...
9489,130,28,0,25,733,989
9490,130,28,0,25,164,69
9491,130,28,0,25,201,989
9492,130,28,0,25,486,832


In [45]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,train_size=0.8,random_state=1)

print(X_train)
print(X_test)
#print(Y_train)
#print(Y_test)

      Name  Team_x  Position  Date
1445   190      19         4  1596
7912    73      14         6  1686
1145   108      19         2   265
502     33       2         5   265
3266   176      16         4    90
...    ...     ...       ...   ...
2956   300      11         4  1444
7977    31      18         5  1535
914    327       2         6  1589
5307   248      12         3  1143
238    139       1         3  1272

[7424 rows x 4 columns]
      Name  Team_x  Position  Date
5545   287      12         0   395
2541    42       5         5   146
4925   295       9         4  1308
6713   100      23         6    61
2563   269       5         2  1525
...    ...     ...       ...   ...
110      3       1         0   621
6679   138      23         0   502
1679   314      22         3  1034
2047   340      27         5   663
6889   129      25         5  1673

[1856 rows x 4 columns]


In [46]:
Y_train.value_counts()

10     1957
989     836
330     505
977     113
758      90
       ... 
534       1
154       1
122       1
948       1
375       1
Name: Notes, Length: 1385, dtype: int64

In [47]:
print (X)

      Name  Team_x  Position  Date
1      154       1         6    92
2      154       1         6   730
3      154       1         6   736
4      154       1         6  1336
5      154       1         6   670
...    ...     ...       ...   ...
9489   130      28         0   733
9490   130      28         0   164
9491   130      28         0   201
9492   130      28         0   486
9493   130      28         0   490

[9280 rows x 4 columns]


In [48]:
print (Y)

1       1225
2        989
3        860
4         10
5        634
        ... 
9489     989
9490      69
9491     989
9492     832
9493      10
Name: Notes, Length: 9280, dtype: int64


In [49]:
#### Logistic Regression

In [50]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
log.fit(X_train,Y_train)

LogisticRegression()

In [51]:
Y_pred = log.predict(X_test)
Y_pred

array([10, 10, 10, ..., 10, 10, 10])

In [52]:
Y_test.values

array([338, 784,  10, ..., 330,  10, 261])

In [53]:
from sklearn.metrics import accuracy_score
score = accuracy_score(Y_test,Y_pred)
score

0.2817887931034483

In [55]:
#### Change Threshold Value
Y_pred_new = log.predict_proba(X_test)
Y_pred_new

array([[1.13712091e-04, 1.14503831e-04, 1.25023726e-04, ...,
        4.49090278e-05, 8.48879140e-05, 3.26521595e-04],
       [6.71889956e-04, 6.93600108e-04, 5.86558336e-04, ...,
        4.26400047e-04, 6.71399387e-04, 7.83079581e-04],
       [1.14451766e-05, 1.55663132e-05, 2.70925878e-06, ...,
        1.67607809e-07, 1.30215187e-05, 3.32016719e-05],
       ...,
       [1.04217587e-05, 1.28889872e-05, 4.15242215e-06, ...,
        4.26995835e-07, 1.00942902e-05, 3.27072984e-05],
       [1.52335129e-05, 1.64786856e-05, 1.22574937e-05, ...,
        2.50728382e-06, 1.18385611e-05, 5.30820629e-05],
       [2.53422294e-05, 4.22555970e-05, 2.07560823e-06, ...,
        7.43701611e-08, 4.54397056e-05, 3.96615998e-05]])

In [56]:
threshold = 0.4
Y_pred_new = log.predict_proba(X_test)[:,1]
Y_pred_new = Y_pred_new >= threshold
Y_pred_new = Y_pred_new.astype('int')
Y_pred_new

array([0, 0, 0, ..., 0, 0, 0])

In [57]:
from sklearn.metrics import accuracy_score
score = accuracy_score(Y_test,Y_pred_new)
score

0.0

In [None]:
### KNN

In [72]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train,Y_train)
Y_pred_knn = knn.predict(X_test)
Y_pred_knn

array([10, 10, 10, ..., 10, 10, 10])

In [73]:
from sklearn.metrics import accuracy_score
score = accuracy_score(Y_test,Y_pred_knn)
score

0.2785560344827586

Hyperparameter Tuning with GridSearchCV
We will use GridSearchCV to try and improve the performance of these models

1. Logistic Regression Tuning
2. XGBoost Tuning
3. KNN Tuning


Logistic Regression Tuning