<a href="https://colab.research.google.com/github/Amarsinh0/MY-NOTES/blob/main/ml_code__collection_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings

warnings.filterwarnings('ignore')

"""
The code `warnings.filterwarnings('ignore')` is used to suppress warning messages in the code execution.

Warnings are informative messages that alert the user about potential issues or
non-optimal practices in the code. They are intended to provide guidance and suggestions
 for improving code quality or avoiding potential problems. However, sometimes these warning
messages can be excessive or unnecessary, especially in certain scenarios or when working with third-party libraries.

By using `warnings.filterwarnings('ignore')`, the code instructs the Python interpreter to
ignore any warning messages that may arise during the code execution. This ensures that the
warnings will not be displayed in the console or interrupt the program flow. It can be useful
when you are confident that the code is functioning correctly and you do not need to be alerted about the warnings.

It is important to note that suppressing warnings should be done with caution. It is generally
recommended to address and fix any issues or potential problems indicated by the warnings rather
than ignoring them outright. Ignoring warnings without proper consideration may lead to unintended
consequences or overlooking important aspects of the code.

"""

# **categoriacal varialble**

## **find categorical variables**

In [None]:
# find categorical variables

categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :', categorical)

output:-

There are 7 categorical variables

The categorical variables are : ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']


"""
The code snippet is used to find and display the categorical variables in a DataFrame `df`.

Here's a breakdown of the code:

1. `categorical = [var for var in df.columns if df[var].dtype=='O']`:
   - This line creates a list comprehension that iterates over the columns of the DataFrame `df`.
   - For each column, it checks if the data type is `'O'`, which typically
     represents object or string type columns (categorical variables).
   - If the condition is met, the column name is added to the `categorical` list.

2. `print('There are {} categorical variables\n'.format(len(categorical)))`:
   - This line prints the number of categorical variables found in the DataFrame using string formatting.
   - The `len(categorical)` returns the length (number of elements) of the `categorical` list.
   - The number is inserted into the string using `{}` as a placeholder, and `.format()`
    is used to substitute the value into the string.

3. `print('The categorical variables are :', categorical)`:
   - This line prints the list of categorical variables found in the DataFrame.
   - The list `categorical` is printed as is, displaying the names of the categorical variables.

The output of this code will be the number of categorical variables found in
the DataFrame `df` and the list of their names.
"""



## **# print categorical variables containing missing values**

In [None]:
# print categorical variables containing missing values

cat1 = [var for var in categorical if df[var].isnull().sum()!=0]

print(df[cat1].isnull().sum())


In [None]:
 # view frequency of categorical variables

for var in categorical:

    print(df[var].value_counts())

In [None]:
# check for cardinality in categorical variables

for var in categorical:

    print(var, ' contains ', len(df[var].unique()), ' labels')

output:

Date  contains  3436  labels
Location  contains  49  labels

# **Feature Engineering of Date Variable**

In [None]:
# parse the dates, currently coded as strings, into datetime format

df['Date'] = pd.to_datetime(df['Date'])



# extract year from date

df['Year'] = df['Date'].dt.year

df['Year'].head()

output:

0    2008
1    2008
2    2008
3    2008
4    2008
Name: Year, dtype: int64




# extract month from date

df['Month'] = df['Date'].dt.month

df['Month'].head()

0    12
1    12
2    12
3    12
4    12

In [None]:
# drop the original Date variable

df.drop('Date', axis=1, inplace = True)


#
# One Hot **Encoding**

In [None]:
# let's do One Hot Encoding of Location variable
# get k-1 dummy variables after One Hot Encoding
# preview the dataset with head() method

pd.get_dummies(df.Location, drop_first=True).head()

"""
The code `pd.get_dummies(df.Location, drop_first=True).head()` is used to perform one-hot encoding on the
categorical variable "Location" in the DataFrame `df`.

Here's a breakdown of the code:

1. `pd.get_dummies(df.Location, drop_first=True)`:
   - `pd.get_dummies()` is a pandas function that converts categorical variables into dummy/indicator variables.
   - `df.Location` selects the "Location" column from the DataFrame `df` to be one-hot encoded.
   - `drop_first=True` is an optional parameter that specifies whether to drop the first category
      in each variable. This is done to prevent multicollinearity issues when using
      the one-hot encoded variables in a regression model.

2. `.head()`:
   - The `.head()` method is called on the resulting DataFrame to display the first few rows.

The output of this code will be a DataFrame that contains the one-hot encoded representation
of the "Location" variable, where each unique category in the "Location" column is
represented by a separate binary column. The `drop_first=True` parameter ensures that
only (n-1) dummy columns are created for n categories.

In [None]:
# sum the number of 1s per boolean variable over the rows of the dataset
# it will tell us how many observations we have for each category

pd.get_dummies(df.WindGustDir, drop_first=True, dummy_na=True).sum(axis=0)


'''
axis=0 specifies that the sum should be calculated vertically, i.e., for each category column.
The output of this code will be a Series that shows the sum of the number of occurrences (1s)
per category in the "WindGustDir" column. Each category will be represented as an index label in the Series,
and the corresponding value will be the sum of occurrences for that category.

'''


# **Explore Numerical Variables**

In [None]:
# find numerical variables

numerical = [var for var in df.columns if df[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :', numerical)



output:-There are 19 numerical variables
'''
The code snippet is used to find and display the numerical variables in a DataFrame `df`.

Here's a breakdown of the code:

1. `numerical = [var for var in df.columns if df[var].dtype!='O']`:
   - This line creates a list comprehension that iterates over the columns of the DataFrame `df`.
   - For each column, it checks if the data type is not equal to `'O'`,
    indicating that it is not an object or string type column (numerical variable).
   - If the condition is met, the column name is added to the `numerical` list.

2. `print('There are {} numerical variables\n'.format(len(numerical)))`:
   - This line prints the number of numerical variables found in the DataFrame using string formatting.
   - The `len(numerical)` returns the length (number of elements) of the `numerical` list.
   - The number is inserted into the string using `{}` as a placeholder, and `.format()`
    is used to substitute the value into the string.

3. `print('The numerical variables are :', numerical)`:
   - This line prints the list of numerical variables found in the DataFrame.
   - The list `numerical` is printed as is, displaying the names of the numerical variables.

The output of this code will be the number of numerical variables found in the DataFrame `df` and the list of their names.
'''


In [None]:
# view summary statistics in numerical variables

print(round(df[numerical].describe()),2)


'''
The code `print(round(df[numerical].describe()),2)` is used to view summary
statistics of the numerical variables in the DataFrame `df`.

Here's a breakdown of the code:

1. `df[numerical].describe()`:
   - This line selects the columns corresponding to the numerical variables in
   the DataFrame `df` using `df[numerical]`.
   - `.describe()` is a method that computes various summary statistics of the
   selected numerical columns, including count, mean, standard deviation, minimum value,
    25th percentile, median (50th percentile), 75th percentile, and maximum value.

2. `round(..., 2)`:
   - The `round()` function is used to round the values in the summary statistics to two decimal places.
   - It takes the result of `df[numerical].describe()` as input and specifies the number of decimal places to round to.

3. `print(...)`:
   - This line prints the rounded summary statistics of the numerical variables.

The output of this code will be the summary statistics of the numerical variables in
 the DataFrame `df`, including count, mean, standard deviation, minimum value, quartiles,
  and maximum value, rounded to two decimal places.

'''


In [None]:
plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.boxplot(column='Rainfall')
fig.set_title('')
fig.set_ylabel('Rainfall')

'''
The code snippet is used to create a boxplot of the "Rainfall" variable in the DataFrame `df` using Matplotlib.

Here's a breakdown of the code:

1. `plt.figure(figsize=(15,10))`:
   - This line creates a new figure with a specified figure size of 15 inches (width) and 10 inches (height).
   - The `figsize` parameter is set using a tuple `(15, 10)`.

2. `plt.subplot(2, 2, 1)`:
   - This line creates a subplot in a 2x2 grid and selects the first subplot (top-left).
   - The `subplot` function takes three arguments: the number of rows, the number of columns,
     and the index of the subplot to select.
   - In this case, a 2x2 grid is created, and the first subplot is selected with an index of 1.

3. `fig = df.boxplot(column='Rainfall')`:
   - This line creates a boxplot of the "Rainfall" variable in the DataFrame `df`.
   - The `column` parameter is set to `'Rainfall'` to specify the column to use for the boxplot.
   - The resulting boxplot is assigned to the variable `fig`.

4. `fig.set_title('')`:
   - This line sets an empty title for the boxplot.
   - The `set_title()` method is called on the `fig` object.

5. `fig.set_ylabel('Rainfall')`:
   - This line sets the label for the y-axis of the boxplot to `'Rainfall'`.
   - The `set_ylabel()` method is called on the `fig` object.

     The output of this code will be a boxplot of the "Rainfall" variable in the DataFrame `df`,
     displayed in the first subplot of a 2x2 grid. The y-axis will be labeled as "Rainfall".
'''

In [None]:
plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.Rainfall.hist(bins=10)
fig.set_xlabel('Rainfall')
fig.set_ylabel('RainTomorrow')


In [None]:
# find outliers for Rainfall variable

IQR = df.Rainfall.quantile(0.75) - df.Rainfall.quantile(0.25)
Lower_fence = df.Rainfall.quantile(0.25) - (IQR * 3)
Upper_fence = df.Rainfall.quantile(0.75) + (IQR * 3)
print('Rainfall outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))

Rainfall outliers are values < -2.4000000000000004 or > 3.2



'''
The code snippet is used to find outliers for the "Rainfall" variable in the DataFrame `df`.

Here's a breakdown of the code:

1. `IQR = df.Rainfall.quantile(0.75) - df.Rainfall.quantile(0.25)`:
   - This line calculates the interquartile range (IQR) of the "Rainfall" variable.
   - The `.quantile()` method is used to calculate the 25th percentile (Q1) and
   the 75th percentile (Q3), and the difference between them gives the IQR.

2. `Lower_fence = df.Rainfall.quantile(0.25) - (IQR * 3)`:
   - This line calculates the lower fence for outliers.
   - The lower fence is calculated as Q1 - (IQR * 3), where Q1 is the 25th percentile.

3. `Upper_fence = df.Rainfall.quantile(0.75) + (IQR * 3)`:
   - This line calculates the upper fence for outliers.
   - The upper fence is calculated as Q3 + (IQR * 3), where Q3 is the 75th percentile.

4. `print('Rainfall outliers are values < {lowerboundary} or > {upperboundary}'
    .format(lowerboundary=Lower_fence, upperboundary=Upper_fence))`:
   -This line prints the range of values that are considered outliers for the "Rainfall" variable.
   -The format string `{lowerboundary}` and `{upperboundary}` are replaced with the values of
    Lower_fence` and `Upper_fence`, respectively.
   -This provides the lower and upper boundaries for outliers based on the calculated fences.

The output of this code will be a message indicating the range of values that are considered outliers
for the "Rainfall" variable. In this case, it states that outliers are values less than -2.4 or greater than 3.2.

**drop target column**

In [None]:
X = df.drop(['RainTomorrow'], axis=1)

y = df['RainTomorrow']

'''
After executing this code, X will contain the independent variable(s) (all columns except 'RainTomorrow')
and y will contain the dependent variable ('RainTomorrow').
'''



**percentage of missing values in the numerical variables in training set**

In [None]:
print percentage of missing values in the numerical variables in training set

for col in numerical:
    if X_train[col].isnull().mean()>0:
        print(col, round(X_train[col].isnull().mean(),4))
     '''
     The code is used to identify numerical variables in the `X_train` DataFrame that have missing values
      and print the variable names along with the proportion of missing values.

Here's how it works:

1. `for col in numerical:`: This line iterates over each variable name in the `numerical` list,
    which contains the names of the numerical variables in the `X_train` DataFrame.

2. `if X_train[col].isnull().mean()>0:`: This line checks if the selected variable (`col`)
    in the `X_train` DataFrame
    has missing values. It does this by calculating the mean of the boolean mask `X_train[col].isnull()`,
    which indicates whether each value in the variable is missing or not. If the mean is greater than 0,
    it means there are missing values in that variable.

3. `print(col, round(X_train[col].isnull().mean(),4))`: If there are missing values in the variable,
    this line prints the variable name (`col`) and the proportion of missing values for that variable.
    The `round()` function is used to round the proportion to four decimal places.

By running this code, you will get a list of numerical variables in the `X_train` DataFrame that have missing values,
along with the proportion of missing values for each variable.

In [None]:
# impute missing values in X_train and X_test with respective column median in X_train

for df1 in [X_train, X_test]:
    for col in numerical:
        col_median=X_train[col].median()
        df1[col].fillna(col_median, inplace=True)

'''
The code snippet is used to impute missing values in the `X_train` and `X_test` DataFrames
using the respective column medians from the `X_train` DataFrame.

Here's a breakdown of the code:

1. `for df1 in [X_train, X_test]:`:
   - This line iterates over the list `[X_train, X_test]`, which contains the `X_train` and `X_test` DataFrames.

2. `for col in numerical:`:
   - This line iterates over each variable name in the `numerical` list, which
     contains the names of the numerical variables.

3. `col_median = X_train[col].median()`:
   - This line calculates the median of the selected variable (`col`) in the `X_train` DataFrame.
   - The `.median()` method is used to calculate the median value.

4. `df1[col].fillna(col_median, inplace=True)`:
   - This line fills the missing values in the selected variable (`col`) of the current DataFrame (`df1`)
     with the corresponding column median from the `X_train` DataFrame.
   - The `.fillna()` method is used to fill missing values, and `col_median` is used as the value
     to fill the missing values with.
   - The `inplace=True` argument ensures that the changes are applied directly to the DataFrame.

By executing this code, the missing values in the numerical variables of both `X_train` and `X_test`
will be imputed with the respective column medians from the `X_train` DataFrame.
'''

## **Engineering missing values in categorical variables¶**

In [None]:
# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()

Location       0.000000
WindGustDir    0.065114

dtype: float64




# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

WindGustDir 0.06511419378659213
WindDir9am 0.07013379749283542


# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:
    df2['WindGustDir'].fillna(X_train['WindGustDir'].mode()[0], inplace=True)
    df2['WindDir9am'].fillna(X_train['WindDir9am'].mode()[0], inplace=True)




# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()



# As a final check, we will check for missing values in X_train and X_test.

# check missing values in X_train

X_train.isnull().sum()



## **Engineering outliers in numerical variables**


In [None]:
def max_value(df3, variable, top):
    return np.where(df3[variable]>top, top, df3[variable])

for df3 in [X_train, X_test]:
    df3['Rainfall'] = max_value(df3, 'Rainfall', 3.2)
    df3['Evaporation'] = max_value(df3, 'Evaporation', 21.8)

'''
The code snippet defines a function `max_value` and uses it to cap the values in the
'Rainfall' and 'Evaporation' variables in both the `X_train` and `X_test` DataFrames.

Here's a breakdown of the code:

1. `def max_value(df3, variable, top):`:
   - This line defines a function named `max_value` that takes three arguments: `df3` (DataFrame),
   `variable` (name of the variable), and `top` (maximum value to cap).

2. `return np.where(df3[variable]>top, top, df3[variable])`:
   - This line uses `np.where` to compare the values in the specified variable (`variable`)
     of the DataFrame (`df3`) with the given `top` value.
   - If a value in the variable is greater than `top`, it is replaced with `top`; otherwise,
     the original value is kept.
   - The function returns the resulting array of capped values.

3. `for df3 in [X_train, X_test]:`:
   - This line iterates over the list `[X_train, X_test]`, which contains
     the `X_train` and `X_test` DataFrames.

4. `df3['Rainfall'] = max_value(df3, 'Rainfall', 3.2)`:
   - This line calls the `max_value` function to cap the values in the 'Rainfall' variable
     of the current DataFrame (`df3`) at 3.2.
   - The capped values are assigned back to the 'Rainfall' variable in the DataFrame.

5. `df3['Evaporation'] = max_value(df3, 'Evaporation', 21.8)`:
   - This line calls the `max_value` function to cap the values in the 'Evaporation'
     variable of the current DataFrame (`df3`) at 21.8.
   - The capped values are assigned back to the 'Evaporation' variable in the DataFrame.

By executing this code, the values in the 'Rainfall' and 'Evaporation' variables of
both `X_train` and `X_test` will be capped at the specified maximum values.
Any value greater than the maximum will be replaced with the maximum value.

In [None]:
pip install category-encoders

# encode RainToday variable

import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['RainToday'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)



'''
The code snippet is used to encode the 'RainToday' variable using binary encoding.
Here's a breakdown of the code:

1. `import category_encoders as ce`:
   - This line imports the `category_encoders` library, which provides various
     encoding techniques for categorical variables.

2. `encoder = ce.BinaryEncoder(cols=['RainToday'])`:
   - This line creates an instance of the `BinaryEncoder` class from `category_encoders`.
   - The `BinaryEncoder` is a type of categorical encoding that converts categorical
     variables into binary representations.

3. `X_train = encoder.fit_transform(X_train)`:
   - This line applies the binary encoding transformation to
     the 'RainToday' variable in the `X_train` DataFrame.
   - The `.fit_transform()` method is used to fit the encoder on the training data (`X_train`) and
     transform it into the encoded representation.
   - The transformed data is assigned back to the `X_train` DataFrame, overwriting
     the original 'RainToday' variable with the encoded binary columns.

4. `X_test = encoder.transform(X_test)`:
   - This line applies the same binary encoding transformation
     to the 'RainToday' variable in the `X_test` DataFrame.
   - The `.transform()` method is used to transform the test data (`X_test`) using the fitted encoder.
   - The transformed data is assigned back to the `X_test` DataFrame, replacing
     the original 'RainToday' variable with the encoded binary columns.

By executing this code, the 'RainToday' variable in both `X_train` and `X_test` will be
encoded using binary encoding. Each category in the variable will be represented by a set of binary columns,
 with each column indicating the presence or absence of a particular category.

'''

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


'''
The code snippet is used to apply min-max scaling to the features in
the `X_train` and `X_test` datasets. Here's a breakdown of the code:

1. `from sklearn.preprocessing import MinMaxScaler`:
   - This line imports the `MinMaxScaler` class from the `sklearn.preprocessing` module,
     which is used for min-max scaling.

2. `scaler = MinMaxScaler()`:
   - This line creates an instance of the `MinMaxScaler` class,
     which will be used to perform the scaling.

3. `X_train = scaler.fit_transform(X_train)`:
   - This line applies the min-max scaling transformation to the features in the `X_train` dataset.
   - The `.fit_transform()` method is used to fit the scaler on
     the training data (`X_train`) and transform it.
   - The transformed data is assigned back to the `X_train` dataset,
     overwriting the original feature values with the scaled values.

4. `X_test = scaler.transform(X_test)`:
   - This line applies the same scaling transformation to the features in the `X_test` dataset.
   - The `.transform()` method is used to transform the test data (`X_test`) using the fitted scaler.
   - The transformed data is assigned back to the `X_test` dataset,
     replacing the original feature values with the scaled values.

By executing this code, the features in both `X_train` and `X_test` will be scaled
 using min-max scaling. Min-max scaling transforms the values of each feature to a range between 0 and 1, based
 on the minimum and maximum values of the feature in the training data. This scaling ensures
 that all features have a similar scale and prevents any particular feature
 from dominating the learning algorithm based on its magnitude.

'''

In [None]:
# probability of getting output as 0 - no rain

logreg.predict_proba(X_test)[:,0]


# probability of getting output as 1 - rain

logreg.predict_proba(X_test)[:,1]


#from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

Model accuracy score: 0.8502


## **Compare the train-set and test-set accuracy**

In [None]:
y_pred_train = logreg.predict(X_train)

y_pred_train

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

Training-set accuracy score: 0.8476

# **Check for overfitting and underfitting**

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(logreg.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(logreg.score(X_test, y_test)))

Training set score: 0.8476
Test set score: 0.8502


In [None]:

# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_test)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

Confusion matrix

 [[20892  1175]
 [ 3086  3286]]

True Positives(TP) =  20892

True Negatives(TN) =  3286

False Positives(FP) =  1175

False Negatives(FN) =  3086






The confusion matrix shows 20892 + 3285 = 24177 correct predictions and
 3087 + 1175 = 4262 incorrect predictions.

In this case, we have

True Positives (Actual Positive:1 and Predict Positive:1) - 20892

True Negatives (Actual Negative:0 and Predict Negative:0) - 3285

False Positives (Actual Negative:0 but Predict Positive:1) - 1175 (Type I error)

False Negatives (Actual Positive:1 but Predict Negative:0) - 3087 (Type II error)




'''
A confusion matrix is a table that is used to evaluate the performance of a classification model.
It summarizes the predictions made by the model on a test dataset and compares them to the actual class labels.
The confusion matrix provides insights into the accuracy of the model by breaking
down the predictions into four categories: true positives (TP), true negatives (TN),
 false positives (FP), and false negatives (FN).

Here's a breakdown of the code and the information provided by the confusion matrix:

1. `from sklearn.metrics import confusion_matrix`:
   - This line imports the `confusion_matrix` function from the `sklearn.metrics` module,
     which is used to compute the confusion matrix.

2. `cm = confusion_matrix(y_test, y_pred_test)`:
   - This line calculates the confusion matrix by comparing the true class
     labels (`y_test`) with the predicted class labels (`y_pred_test`).

3. `print('Confusion matrix\n\n', cm)`:
   - This line prints the confusion matrix, which is a table showing the counts
     of true positives, true negatives, false positives, and false negatives.

4. `print('\nTrue Positives(TP) = ', cm[0,0])`:
   - This line prints the count of true positives (TP), which represents the number
      of observations that were correctly predicted as positive.

5. `print('\nTrue Negatives(TN) = ', cm[1,1])`:
   - This line prints the count of true negatives (TN), which represents the number
     of observations that were correctly predicted as negative.

6. `print('\nFalse Positives(FP) = ', cm[0,1])`:
   - This line prints the count of false positives (FP), which represents the number
     of observations that were incorrectly predicted as positive.

7. `print('\nFalse Negatives(FN) = ', cm[1,0])`:
   - This line prints the count of false negatives (FN), which represents the number
     of observations that were incorrectly predicted as negative.

Summary:
The confusion matrix provided by the code has the following information:
- True Positives (TP): 20,892
- True Negatives (TN): 3,286
- False Positives (FP): 1,175
- False Negatives (FN): 3,086

The confusion matrix allows us to assess the performance of a classification model by providing a
comprehensive view of its predictive accuracy. The counts in each category help us understand how well
the model is correctly classifying positive and negative instances.


In [None]:
# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'],
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')





'''
The code you provided visualizes the confusion matrix using a heatmap
with the help of the Seaborn library.

Here's a breakdown of the code:

1. `cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'],
     index=['Predict Positive:1', 'Predict Negative:0'])`:
   - This line creates a Pandas DataFrame called `cm_matrix` using the values
     from the confusion matrix `cm`.
   - It specifies the column names as "Actual Positive:1" and "Actual Negative:0"
     to represent the actual class labels.
   - It also specifies the index names as "Predict Positive:1" and "Predict Negative:0"
     to represent the predicted class labels.

2. `sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')`:
   - This line uses the Seaborn `heatmap` function to create a heatmap of the `cm_matrix` DataFrame.
   - The `annot=True` parameter adds the values from the confusion matrix to each cell of the heatmap.
   - The `fmt='d'` parameter specifies that the values should be displayed as integers.
   - The `cmap='YlGnBu'` parameter sets the color scheme of the heatmap.

 The resulting heatmap visualizes the confusion matrix, where the rows represent the predicted
 class labels and the columns represent the actual class labels. The color intensity of each cell
 indicates the count of observations falling into that category. The annotated values inside the cells
 provide additional information about the counts.

By visualizing the confusion matrix, you can get a clearer understanding of the distribution
of predictions and evaluate the performance of the classification model.

'''

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

          No       0.87      0.95      0.91     22067
         Yes       0.74      0.52      0.61      6372

    accuracy                           0.85     28439
   macro avg       0.80      0.73      0.76     28439
weighted avg       0.84      0.85      0.84     28439



'''
The classification report provides various performance metrics for each class in a classification problem.
 Here's an explanation of the metrics in the classification report and their interpretation:

- Precision: Precision measures the accuracy of positive predictions. It is calculated as
   the ratio of true positive predictions
   to the total predicted positives. In the classification report, precision is reported for each class.
   In this case, the precision for the "No" class is 0.87, and for the "Yes" class is 0.74. Higher precision
   values indicate a lower rate of false positive predictions.

- Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of actual positives
 that are correctly identified. It is calculated as the ratio of true positive predictions to
 the total actual positives.In the classification report, recall is reported for each class. In this case,
 the recall for the "No" class is 0.95, andfor the "Yes" class is 0.52. Higher recall values indicate
 a lower rate of false negative predictions.

- F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of
 the model's accuracy by considering both precision and recall. The F1-score is reported for each class
 in the classification report. In this case, the F1-score for the "No" class is 0.91, and for the "Yes" class is 0.61.
  The F1-score ranges from 0 to 1, with a higher value indicating better performance.

- Support: Support refers to the number of samples in each class. It provides an indication of
the imbalance in the dataset.
In the classification report, support is reported for each class. In this case, there are 22,067 samples in
the "No" class and 6,372 samples in the "Yes" class.

- Accuracy: Accuracy measures the overall correctness of the model's predictions.
It is calculated as the ratio of correct predictions to the total number of predictions.
In this case, the accuracy is reported as 0.85, indicating that the model predicts
 the correct class for 85% of the samples.

- Macro Avg: The macro average calculates the average metric across all classes,
giving equal weight to each class. In the classification report, the macro average precision, recall,
and F1-score are reported. In this case, the macro average precision is 0.80,
the recall is 0.73, and the F1-score is 0.76.

- Weighted Avg: The weighted average calculates the average metric across all classes,
 weighted by the number of samples in each class. In the classification report,
 the weighted average precision, recall, and F1-score are reported. In this case,
 the weighted average precision is 0.84, the recall is 0.85, and the F1-score is 0.84.

In general, higher precision, recall, and F1-score values indicate better model performance.
 However, the interpretation of what is considered a good or bad score depends on the specific context
 and requirements of the classification problem. It is important to consider the trade-off between precision
 and recall based on the specific needs of the application.
'''

In [None]:
TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]


# print classification accuracy

classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)

print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))

Classification accuracy : 0.8502



# print classification error

classification_error = (FP + FN) / float(TP + TN + FP + FN)

print('Classification error : {0:0.4f}'.format(classification_error))

Classification error : 0.1498


# print classification error

classification_error = (FP + FN) / float(TP + TN + FP + FN)

print('Classification error : {0:0.4f}'.format(classification_error))

Classification error : 0.1498


# recall score

recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))

Recall or Sensitivity : 0.8713




#True Positive Rate
#True Positive Rate is synonymous with Recall.

true_positive_rate = TP / float(TP + FN)

print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))

True Positive Rate : 0.8713


#False Positive Rate

false_positive_rate = FP / float(FP + TN)


print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))

False Positive Rate : 0.2634


# Specificity

specificity = TN / (TN + FP)

print('Specificity : {0:0.4f}'.format(specificity))

Specificity : 0.7366



In [None]:
# Adjusting the threshold level

# print the first 10 predicted probabilities of two classes- 0 and 1

y_pred_prob = logreg.predict_proba(X_test)[0:10]

y_pred_prob

array([[0.91382428, 0.08617572],
       [0.83565645, 0.16434355]])
'''
The code snippet is calculating and printing the predicted probabilities for the first 10 samples
 in the test set using the logistic regression model (logreg). The predicted probabilities represent the model's
 estimated probability for each sample belonging to each class (0 and 1).

The output shows an array of shape (10, 2), where each row represents the predicted probabilities for a sample.
 The first column represents the probability of belonging to class 0, and
 the second column represents the probability of belonging to class 1.

For example, the first row [0.91382428, 0.08617572] indicates that the model predicts
a high probability (0.91382428) for the sample to belong to class 0 and a low probability (0.08617572)
for it to belong to class 1. Similarly, the second row [0.83565645, 0.16434355] shows
the predicted probabilities for another sample.
'''