### what is Logistc regresstion?

Logistic regression is a statistical method used for binary classification. It predicts the probability that a given input belongs to a particular category (e.g., success/failure, yes/no) using a logistic function.

### Types of Logistic Regression

1. **Binary Logistic Regression**: Used when the dependent variable has two possible outcomes.
2. **Multinomial Logistic Regression**: Used when the dependent variable has more than two unordered categories.
3. **Ordinal Logistic Regression**: Used when the dependent variable has more than two ordered categories.

### Assumptions of Logistic Regression

1. **Binary Outcome**: The dependent variable must be binary (for binary logistic regression).
2. **Independence of Observations**: The observations should be independent of each other.
3. **Linearity**: There is a linear relationship between the log-odds of the dependent variable and the independent variables.
4. **No Multicollinearity**: Independent variables should not be highly correlated with each other.
5. **Large Sample Size**: A larger sample size is preferred for more reliable results.


In [45]:
# importing library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,recall_score, precision_score, f1_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

#### Load titanics dataset for practice.

In [46]:
df=sns.load_dataset('titanic')

In [47]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


#### Preprocessing Data

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [49]:
round((df.isnull().sum()*100)/len(df),2)

survived        0.00
pclass          0.00
sex             0.00
age            19.87
sibsp           0.00
parch           0.00
fare            0.00
embarked        0.22
class           0.00
who             0.00
adult_male      0.00
deck           77.22
embark_town     0.22
alive           0.00
alone           0.00
dtype: float64

##### Here we see that [age,embarked,deck,embark_town] has missing value. Here [deck] column has more then 50% missing value so we drop this column. And other coumns we impute using machine learning technique.

In [50]:
df.drop(columns=['deck'],inplace=True)

In [51]:
from sklearn.impute import SimpleImputer

In [52]:
imp=SimpleImputer(strategy='mean')
df['age']=imp.fit_transform(df[['age']])


In [53]:
imp=SimpleImputer(strategy='most_frequent')
df[['embarked']]=imp.fit_transform(df[['embarked']])

In [54]:
imp=SimpleImputer(strategy='most_frequent')
df[['embark_town']]=imp.fit_transform(df[['embark_town']])

##### Now check again

In [55]:
round(df.isnull().sum()*100/len(df),2)

survived       0.0
pclass         0.0
sex            0.0
age            0.0
sibsp          0.0
parch          0.0
fare           0.0
embarked       0.0
class          0.0
who            0.0
adult_male     0.0
embark_town    0.0
alive          0.0
alone          0.0
dtype: float64

#### (Interpretation) there is no missing value.

#### Now we deal on Outlier

In [64]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,684.0,684.0,684.0,684.0,684.0,684.0
mean,0.339181,2.504386,29.005059,0.27193,0.267544,16.317275
std,0.473778,0.714816,9.440358,0.500982,0.75569,12.611508
min,0.0,1.0,3.0,0.0,0.0,0.0
25%,0.0,2.0,23.0,0.0,0.0,7.8542
50%,0.0,3.0,29.699118,0.0,0.0,10.5
75%,1.0,3.0,33.0,0.0,0.0,23.0625
max,1.0,3.0,54.0,2.0,6.0,57.0


#### There is significant outlier in [age,fare]. Here 'age' and 'fare' columns's standard deviation are following by 9.44 and 12.66 . So our terget to reduce standard deviation.

In [65]:
# remove outliers
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]    


In [66]:
# call the function for multiple columns
df = remove_outliers(df, 'age')
df = remove_outliers(df, 'fare')


In [67]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,605.0,605.0,605.0,605.0,605.0,605.0
mean,0.300826,2.596694,28.61846,0.242975,0.238017,13.56973
std,0.458997,0.642833,7.795477,0.483644,0.722997,8.467381
min,0.0,1.0,8.0,0.0,0.0,0.0
25%,0.0,2.0,23.0,0.0,0.0,7.7958
50%,0.0,3.0,29.699118,0.0,0.0,9.225
75%,1.0,3.0,32.0,0.0,0.0,16.1
max,1.0,3.0,48.0,2.0,5.0,40.125


#### Here we reduce standard deviation of both columns

### Interpratation: No segnificant outlier.

### Now we work on feature encoding.

Feature Encoding is Important for Logistic Regression
Categorical Data: Logistic regression requires numerical input. Categorical features must be converted to numerical form.

Model Performance: Proper encoding helps the model understand relationships in the data, improving predictions.

Interpretability: Encoded features allow for easier interpretation of the model coefficients.

In [70]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True
5,0,3,male,29.699118,0,0,8.4583,Q,Third,man,True,Queenstown,no,True
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,Southampton,yes,False


In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 605 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     605 non-null    int64   
 1   pclass       605 non-null    int64   
 2   sex          605 non-null    object  
 3   age          605 non-null    float64 
 4   sibsp        605 non-null    int64   
 5   parch        605 non-null    int64   
 6   fare         605 non-null    float64 
 7   embarked     605 non-null    object  
 8   class        605 non-null    category
 9   who          605 non-null    object  
 10  adult_male   605 non-null    bool    
 11  embark_town  605 non-null    object  
 12  alive        605 non-null    object  
 13  alone        605 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(5)
memory usage: 58.6+ KB


In [73]:
# encode the categorical columns using loop where column are `object` and `category`
for col in df.columns:
    if df[col].dtype in ['object', 'category']:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])



In [74]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True
5,0,3,1,29.699118,0,0,8.4583,1,2,1,True,1,0,True
8,1,3,0,27.0,0,2,11.1333,2,2,2,False,2,1,False


### interpratation: categorical column are encoded. And our data Prepration done.

#### Now we deal on Machine Learning

In [75]:
# Select features and target variable
X = df.drop(columns=['survived'])
y = df['survived']

In [76]:
# Slipt the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [78]:
# called model
model = LogisticRegression()


In [79]:
# Fit the model on training data
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [80]:
# predict the target variable on test data
y_pred = model.predict(X_test)

In [81]:
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
recall = recall_score(y_test, y_pred)

In [82]:
# print the evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Recall: {recall}")
print(f"Classification Report:\n{class_report}")

Accuracy: 1.0
Confusion Matrix:
[[86  0]
 [ 0 35]]
Recall: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        86
           1       1.00      1.00      1.00        35

    accuracy                           1.00       121
   macro avg       1.00      1.00      1.00       121
weighted avg       1.00      1.00      1.00       121



Model Performance
The classification report indicates excellent performance of the model:

Precision: Both classes (0 and 1) have a precision of 1.00, meaning all positive predictions are correct.

Recall: Both classes have a recall of 1.00, indicating that the model correctly identifies all positive cases.

F1-Score: Both classes achieve an F1-score of 1.00, reflecting a perfect balance between precision and recall.

Support: The number of true instances for class 0 is 86, and for class 1 is 35.

Overall Accuracy
The overall accuracy of the model is 1.00, indicating that it correctly classified all instances in the dataset.

In [84]:
# save the model
import pickle
with open('./all_model/logistic_model.pkl', 'wb') as file:
    pickle.dump(model, file)