## 4: Predicting Chronic Kidney Disease in Patients

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus on steps exploring data, building models and evaluating the models we build.

There are three links you may find important:
- [A set of chronic kidney disease (CKD) data and other biological factors](./chronic_kidney_disease_full.csv).
- [The CKD data dictionary](./chronic_kidney_disease_header.txt).
- [An article comparing the use of k-nearest neighbors and support vector machines on predicting CKD](./chronic_kidney_disease.pdf).

## Step 1: Define the problem.

Suppose you're working for Mayo Clinic, widely recognized to be the top hospital in the United States. In your work, you've overheard nurses and doctors discuss test results, then arrive at a conclusion as to whether or not someone has developed a particular disease or condition. For example, you might overhear something like:

> **Nurse**: Male 57 year-old patient presents with severe chest pain. FDP _(short for fibrin degradation product)_ was elevated at 13. We did an echo _(echocardiogram)_ and it was inconclusive.

> **Doctor**: What was his interarm BP? _(blood pressure)_

> **Nurse**: Systolic was 140 on the right; 110 on the left.

> **Doctor**: Dammit, it's an aortic dissection! Get to the OR _(operating room)_ now!

> _(intense music playing)_

In this fictitious but [Shonda Rhimes-esque](https://en.wikipedia.org/wiki/Shonda_Rhimes#Grey's_Anatomy,_Private_Practice,_Scandal_and_other_projects_with_ABC) scenario, you might imagine the doctor going through a series of steps like a [flowchart](https://en.wikipedia.org/wiki/Flowchart), or a series of if-this-then-that steps to diagnose a patient. The first steps made the doctor ask what the interarm blood pressure was. Because interarm blood pressure took on the values it took on, the doctor diagnosed the patient with an aortic dissection.

Your goal, as a research biostatistical data scientist at the nation's top hospital, is to develop a medical test that can improve upon our current diagnosis system for [chronic kidney disease (CKD)](https://www.mayoclinic.org/diseases-conditions/chronic-kidney-disease/symptoms-causes/syc-20354521).

**Real-world problem**: Develop a medical diagnosis test that is better than our current diagnosis system for CKD.

**Data science problem**: Develop a medical diagnosis test that reduces both the number of false positives and the number of false negatives.

---

## Step 2: Obtain the data.

In [1]:
# Importing relevant liberaries

import pandas as pd
import numpy as np

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

### 1. Read in the data.

In [2]:
df_CKD = pd.read_csv('chronic_kidney_disease_full.csv')

In [3]:
df_CKD.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


In [4]:
df_CKD.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   rbc     248 non-null    object 
 6   pc      335 non-null    object 
 7   pcc     396 non-null    object 
 8   ba      396 non-null    object 
 9   bgr     356 non-null    float64
 10  bu      381 non-null    float64
 11  sc      383 non-null    float64
 12  sod     313 non-null    float64
 13  pot     312 non-null    float64
 14  hemo    348 non-null    float64
 15  pcv     329 non-null    float64
 16  wbcc    294 non-null    float64
 17  rbcc    269 non-null    float64
 18  htn     398 non-null    object 
 19  dm      398 non-null    object 
 20  cad     398 non-null    object 
 21  appet   399 non-null    object 
 22  pe

In [5]:
df_CKD.shape

(400, 25)

### 2. Check out the data dictionary. What are a few features or relationships you might be interested in checking out?

Answer:
- checking out if there is any missing values
- Checking out the distribution of each varaible
- checking out the association of each varaible to evaluateif there is multicollinearity issue


---

## Step 3: Explore the data.

### 3. How much of the data is missing from each column?

In [6]:
df_CKD.isnull().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

Ans: All except 'class' column have one or more missing values

### 4. Suppose that I dropped every row that contained at least one missing value. (In the context of analysis with missing data, we call this a "complete case analysis," because we keep only the complete cases!) How many rows would remain in our dataframe? What are at least two downsides to doing this?

> There's a good visual on slide 15 of [this deck](https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf) that shows what a complete case analysis looks like if you're interested.  
**Note:** You can clean your data below in step 4 when building a model!

In [7]:
df_CKD.dropna(axis=0, how='any', inplace=True)

df_CKD.shape

(158, 25)

Answer:
Ans: If we drop out  every rows missing value, only 158 rows (out of 400 rows) will remain. 
Downsides:
- Lossing more information may reduce varaibility: We dropped out 242 rows of 400 observations which is about 61% of the data 
- Lossing more data may result in weakning correlation estimates in the data


### 5. Thinking critically about how our data were gathered, it's likely that these records were gathered by doctors and nurses. Brainstorm three potential areas (in addition to the missing data we've already discussed) where this data might be inaccurate or imprecise.

Answer:
- some of the data variables such as appe. are qualitative and there ight be subjective biasedness. Two doctors or nurses 
   may rate the same thing differently.
- Having many missing value might be a result of inaccurate lab measurments or inadequate calibration of measurment apparatus

---

## Step 4: Model the data.

### 6. Suppose that I want to construct a "model" where no person who has CKD will ever be told that they do not have CKD. What (very simple, no machine learning needed) model can I create that will never tell a person with CKD that they do not have CKD?

> Hint: Don't think about `statsmodels` or `scikit-learn` here.

Answer: Build an Overly simplistic mode that tells everyone that they have CKD.

### 7. In problem 6, what common classification metric did we optimize for? Did we minimize false positives or negatives?

Answer:
  We need to optimize sensitivity and minimize false negatives 

### 8. Thinking ethically, what is at least one disadvantage to the model you described in problem 6?

Answer: An overly simple model would tend to tell that all people have the kidney disease and this kind of mistreating  people is unprofessional and unethical.

### 9. Suppose that I want to construct a "model" where a person who does not have CKD will ever be told that they do have CKD. What (very simple, no machine learning needed) model can I create that will accomplish this?

Answer: Build an overly simplified model that tells everyone that they do not have CKD. The only that could occur in this kind of model is a false negative (Type II error).

### 10. In problem 9, what common classification metric did we optimize for? Did we minimize false positives or negatives?

Answer:  We need to optimize specifity and minimize false positives

### 11. Thinking ethically, what is at least one disadvantage to the model you described in problem 9?

Answer:The model tells everyone that they do not have CKD by predicting a false negative and this may cause people who are actually sick not to take medical treatment on time

### 12. Construct a logistic regression model in `sklearn` predicting class from the other variables. You may scale, select/drop, and engineer features as you wish - build a good model! Make sure, however, that you include at least one categorical/dummy feature and at least one quantitative feature.

> Hint 1: Remember to do a train/test split!  
> Hint 2: This will require data cleaning first!

In [8]:
df_CKD.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
9,53.0,90.0,1.02,2.0,0.0,abnormal,abnormal,present,notpresent,70.0,...,29.0,12100.0,3.7,yes,yes,no,poor,no,yes,ckd
11,63.0,70.0,1.01,3.0,0.0,abnormal,abnormal,present,notpresent,380.0,...,32.0,4500.0,3.8,yes,yes,no,poor,yes,no,ckd
14,68.0,80.0,1.01,3.0,2.0,normal,abnormal,present,present,157.0,...,16.0,11000.0,2.6,yes,yes,yes,poor,yes,no,ckd
20,61.0,80.0,1.015,2.0,0.0,abnormal,abnormal,notpresent,notpresent,173.0,...,24.0,9200.0,3.2,yes,yes,yes,poor,yes,yes,ckd


In [9]:
## Asign a binary  for the class column

df_CKD['class_id'] = [1 if i == 'ckd' else 0 for i in df_CKD['class']]

In [10]:
df_CKD.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class,class_id
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,6700.0,3.9,yes,no,no,poor,yes,yes,ckd,1
9,53.0,90.0,1.02,2.0,0.0,abnormal,abnormal,present,notpresent,70.0,...,12100.0,3.7,yes,yes,no,poor,no,yes,ckd,1
11,63.0,70.0,1.01,3.0,0.0,abnormal,abnormal,present,notpresent,380.0,...,4500.0,3.8,yes,yes,no,poor,yes,no,ckd,1
14,68.0,80.0,1.01,3.0,2.0,normal,abnormal,present,present,157.0,...,11000.0,2.6,yes,yes,yes,poor,yes,no,ckd,1
20,61.0,80.0,1.015,2.0,0.0,abnormal,abnormal,notpresent,notpresent,173.0,...,9200.0,3.2,yes,yes,yes,poor,yes,yes,ckd,1


In [11]:
df_CKD.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class', 'class_id'],
      dtype='object')

In [12]:
df_CKD.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 158 entries, 3 to 399
Data columns (total 26 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       158 non-null    float64
 1   bp        158 non-null    float64
 2   sg        158 non-null    float64
 3   al        158 non-null    float64
 4   su        158 non-null    float64
 5   rbc       158 non-null    object 
 6   pc        158 non-null    object 
 7   pcc       158 non-null    object 
 8   ba        158 non-null    object 
 9   bgr       158 non-null    float64
 10  bu        158 non-null    float64
 11  sc        158 non-null    float64
 12  sod       158 non-null    float64
 13  pot       158 non-null    float64
 14  hemo      158 non-null    float64
 15  pcv       158 non-null    float64
 16  wbcc      158 non-null    float64
 17  rbcc      158 non-null    float64
 18  htn       158 non-null    object 
 19  dm        158 non-null    object 
 20  cad       158 non-null    object

In [13]:
#getting dummy variables
rbc_abnormal = pd.get_dummies(df_CKD['rbc'])['abnormal']

In [14]:
pc_abnormal = pd.get_dummies(df_CKD['pc'])['abnormal']

In [15]:
pcc_present = pd.get_dummies(df_CKD['pcc'])['present']

In [16]:
ba_present = pd.get_dummies(df_CKD['ba'])['present']

In [17]:
htn_yes = pd.get_dummies(df_CKD['htn'])['yes']

In [18]:
dm_yes = pd.get_dummies(df_CKD['dm'])['yes']

In [19]:
cad_yes = pd.get_dummies(df_CKD['cad'])['yes']

In [20]:
appet_poor = pd.get_dummies(df_CKD['appet'])['poor']

In [21]:
pe_yes = pd.get_dummies(df_CKD['pe'])['yes']

In [22]:
ane_yes = pd.get_dummies(df_CKD['ane'])['yes']



In [23]:
## Dataframes of numeric and qualitative(qual) columns

df_CKD_numeric = df_CKD[['age', 'bp', 'sg', 'al', 'su', 'bgr',
                 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv',
                 'wbcc', 'rbcc']]

df_CKD_qual = pd.DataFrame([ane_yes, pe_yes, rbc_abnormal, pc_abnormal,
                     pcc_present, ba_present, htn_yes, dm_yes,
                     cad_yes, appet_poor], index=['ane_yes', 'pe_yes', 'rbc_abnormal',
                                                  'pc_abnormal', 'pcc_present',
                                                  'ba_present', 'htn_yes', 'dm_yes',
                                                  'cad_yes', 'appet_poor']).T

In [24]:
## Merge the above data frames
X = df_CKD_numeric.merge(right = df_CKD_qual, left_index = True, right_index = True)

In [25]:
X.head()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,...,ane_yes,pe_yes,rbc_abnormal,pc_abnormal,pcc_present,ba_present,htn_yes,dm_yes,cad_yes,appet_poor
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,1,1,0,1,1,0,1,0,0,1
9,53.0,90.0,1.02,2.0,0.0,70.0,107.0,7.2,114.0,3.7,...,1,0,1,1,1,0,1,1,0,1
11,63.0,70.0,1.01,3.0,0.0,380.0,60.0,2.7,131.0,4.2,...,0,1,1,1,1,0,1,1,0,1
14,68.0,80.0,1.01,3.0,2.0,157.0,90.0,4.1,130.0,6.4,...,0,1,0,1,1,1,1,1,1,1
20,61.0,80.0,1.015,2.0,0.0,173.0,148.0,3.9,135.0,5.2,...,1,1,1,1,0,0,1,1,1,1


In [26]:
X_train, X_test, y_train, y_test=train_test_split(X.fillna(X.mean()),df_CKD['class_id'],test_size = 0.3,random_state = 42)

In [27]:
parameters = {'C': [0.001, 0.01, 0.1, 1, 10],
              'class_weight': [None, 'balanced'],
              'penalty': ['l1', 'l2']}

In [28]:
import random 
random.seed(42)

gs_results = GridSearchCV(estimator = LogisticRegression(random_state = 42, solver='liblinear'), # Specify the model we want to GridSearch.
                          param_grid = parameters,                           # Specify the grid of parameters we want to search.
                          scoring = 'recall',                                # Specify recall as the metric to optimize 
                          cv = 5).fit(X_train, y_train)                      # Set 5-fold cross-validation, then fit. (Default is 3.)

In [29]:
gs_results.best_estimator_

LogisticRegression(C=0.001, class_weight='balanced', penalty='l1',
                   random_state=42, solver='liblinear')

In [30]:
logit = LogisticRegression(C = 1,
                           class_weight = 'balanced',
                           penalty = 'l2',
                           random_state = 42)

In [31]:
logit.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1, class_weight='balanced', random_state=42)

In [32]:
logit.score(X_train, y_train)

1.0

In [33]:
logit.score(X_test, y_test)

0.9791666666666666

---

## Step 5: Evaluate the model.

### 13. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your quantitative features.

In [34]:
list(zip(np.exp(logit.coef_[0]),X.columns))

[(1.0973572729509171, 'age'),
 (1.1960255789020713, 'bp'),
 (0.9990984136930516, 'sg'),
 (1.0921811730920479, 'al'),
 (1.0111031547054565, 'su'),
 (1.1231419852848932, 'bgr'),
 (1.251064588765894, 'bu'),
 (1.0718768174854334, 'sc'),
 (0.6784326922319207, 'sod'),
 (0.9778733098216933, 'pot'),
 (0.9199452941955547, 'hemo'),
 (0.7222053193528646, 'pcv'),
 (1.0022823400097889, 'wbcc'),
 (0.9774703708659412, 'rbcc'),
 (1.0082117779619515, 'ane_yes'),
 (1.0106178710304738, 'pe_yes'),
 (1.020095827995809, 'rbc_abnormal'),
 (1.0107337278762518, 'pc_abnormal'),
 (1.008300738671203, 'pcc_present'),
 (1.002268977791284, 'ba_present'),
 (1.0113525303052981, 'htn_yes'),
 (1.00363962697246, 'dm_yes'),
 (1.0008854272164331, 'cad_yes'),
 (1.0083326180728371, 'appet_poor')]

Ans: As age increases by one unit, an individual is 1.13 times as likely to have CKD, all else held constant.
As bp increases by one unit, an individual is 1.19 times as likely to have CKD, all else held constant.



### 14. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your categorical/dummy features.

Answer: If someone's appetite is poor, appet_poor, they are 1.02 times as likely to have CKD, all else held constant.

### 15. Despite being a relatively simple model, logistic regression is very widely used in the real world. Why do you think that's the case? Name at least two advantages to using logistic regression as a modeling technique.

Answer:
- Logistic regression allows for interpretable coefficients so that we can understand how X affects y.
- Logistic regression usually does not suffer from high variance due to the large number of simplifying assumptions placed on 
  the model. (i.e. features are "linear in the logit," errors are independent and follow a Bernoulli distribution)

### 16. Does it make sense to generate a confusion matrix on our training data or our test data? Why? Think about which data is used for model evaluation. Generate it on the proper data.

> Hint: Once you've generated your predicted $y$ values and you have your observed $y$ values, then it will be easy to [generate a confusion matrix using sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [35]:
y_prediction = logit.predict(X_test)

In [36]:
print(classification_report(y_test,y_prediction))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99        35
           1       1.00      0.92      0.96        13

    accuracy                           0.98        48
   macro avg       0.99      0.96      0.97        48
weighted avg       0.98      0.98      0.98        48



In [37]:
confusion_matrix(y_test, y_prediction)

array([[35,  0],
       [ 1, 12]], dtype=int64)

Yes, it generates a confusion matrix on our test data as it provides a proper evaluation of the models.

### 17. In this hospital case, we want to predict CKD. Do we want to optimize for sensitivity, specificity, or something else? Why? (If you don't think there's one clear answer, that's okay! There rarely is. Be sure to defend your conclusion!)

Answer:
In this hospital case, we want to optimize for sensitivity.A  a False Negative/Type II Error/, is a bad mistake to make, it is not as bad as predicting someone is negative for CKD when they are actually positve, a False Positive/Type I Error.


### 18 (BONUS). Write a function that will create an ROC curve for you, then plot the ROC curve.

Here's a strategy you might consider:
1. In order to even begin, you'll need some fit model. Use your logistic regression model from problem 12.
2. We want to look at all values of your "threshold" - that is, anything where .predict() gives you above your threshold falls in the "positive class," and anything that is below your threshold falls in the "negative class." Start the threshold at 0.
3. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
4. Increment your threshold by some "step." Maybe set your step to be 0.01, or even smaller.
5. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
6. Repeat steps 3 and 4 until you get to the threshold of 1.
7. Plot the values of sensitivity and 1 - specificity.

### 19. Suppose you're speaking with the biostatistics lead at Mayo Clinic, who asks you "Why are unbalanced classes generally a problem? Are they a problem in this particular CKD analysis?" How would you respond?

Answer:
Unbalanced classes are generally a problem because the minority class is at risk of not having enough exposure during the model process to be accounted for in the model. There's not enough data to learn the "pattern" or "behavior" of the minority class.

### 20. Suppose you're speaking with a doctor at Mayo Clinic who, despite being very smart, doesn't know much about data science or statistics. How would you explain why unbalanced classes are generally a problem to this doctor?

Answer:
Let's say you treat a patient who comes in very common symptoms. These happen pretty frequently, so you have a good idea of how to treat it, on the other hand, Assume that you are treating a patient who presents with odd symptoms that you've never seen before. You check books and do research, but it's very hard to understand this disease becauses you just don't have enough information to identify causes or recommend treatments.

### 21. Let's create very unbalanced classes just for the sake of this example! Generate very unbalanced classes by [bootstrapping](http://stattrek.com/statistics/dictionary.aspx?definition=sampling_with_replacement) (a.k.a. random sampling with replacement) the majority class.

1. The majority class are those individuals with CKD.
2. Generate a random sample of size 200,000 of individuals who have CKD **with replacement**. (Consider setting a random seed for this part!). The [`pandas .sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) method may be _very_ useful here!
3. Create a new dataframe with the original data plus this random sample of data.
4. Now we should have a dataset with just over 200,000 observations, of which only about 0.075% are non-CKD individuals.

In [38]:
df_CKD_sample = df_CKD[df_CKD['class'] == 'ckd'].sample(200_000,            # sample n = 200,000
                                               replace = True,     # sample with replacement
                                               random_state = 42)  # set random state

In [39]:
df_CKD_2 = pd.concat([df_CKD, df_CKD_sample])

### 22. What do you expect will be the impact of unbalanced classes on your logistic regression model?

Answer:The impact of unbalanced classes on logistic regression is that it seems to have considerably affected the specificity of the model.


### 23. Build a logistic regression model on the unbalanced class data and evaluate its performance using whatever method(s) you see fit. 
> Be sure to look at how well it performs on non-CKD data.

In [40]:
logit_model = df_CKD_2[['age', 'bp', 'sg', 'al', 'su', 'bgr',
                 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv',
                 'wbcc', 'rbcc','class_id']]

logit_model.loc[:,'rbc_abnormal_2'] = pd.get_dummies(df_CKD_2['rbc'])['abnormal']
logit_model.loc[:,'pc_abnormal_2'] = pd.get_dummies(df_CKD_2['pc'])['abnormal']
logit_model.loc[:,'pcc_present_2'] = pd.get_dummies(df_CKD_2['pcc'])['present']
logit_model.loc[:,'ba_present_2'] = pd.get_dummies(df_CKD_2['ba'])['present']
logit_model.loc[:,'htn_yes_2'] = pd.get_dummies(df_CKD_2['htn'])['yes']
logit_model.loc[:,'dm_yes_2'] = pd.get_dummies(df_CKD_2['dm'])['yes']
logit_model.loc[:,'cad_yes_2'] = pd.get_dummies(df_CKD_2['cad'])['yes']
logit_model.loc[:,'appet_poor_2'] = pd.get_dummies(df_CKD_2['appet'])['poor']
logit_model.loc[:,'pe_yes_2'] = pd.get_dummies(df_CKD_2['pe'])['yes']
logit_model.loc[:,'ane_yes_2'] = pd.get_dummies(df_CKD_2['ane'])['yes']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [41]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(logit_model.fillna(logit_model.mean()).drop(['class_id'],
                                                                                    axis = 1),
                                                    logit_model['class_id'],
                                                    test_size = 0.3, 
                                                    random_state = 42)

In [42]:
logit_2 = LogisticRegression(C = 1,
                             random_state = 42,
                             penalty = 'l2'
                            )

In [43]:
logit_2.fit(X_train_2, y_train_2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1, random_state=42)

In [44]:
print(classification_report(y_test_2, logit_2.predict(X_test_2)))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99        37
           1       1.00      1.00      1.00     60011

    accuracy                           1.00     60048
   macro avg       1.00      0.99      0.99     60048
weighted avg       1.00      1.00      1.00     60048



### 24. Do the results of your model above align with your expectations of the impact of unbalanced classes on logistic regression? If not, do you have any thoughts on why your model, considering the data, is performing how it is?

**Answer:**

Sensitivity of the model decreased when we build the model with sample data