# Capstone

***Student Information***
- **1. Student_ID** – Unique identifier assigned to each student.  
- **2. Name** – Full name of the student.  
- **3. Gender** – Gender of the student (Male/Female/Other).  
- **4. Age** – Age of the student in years.  
- **5. Education_Level** – Highest education level completed by the student.  
- **6. Employment_Status** – Employment status of the student.  
- **7. City** – City where the student resides.  
- **8. Device_Type** – Type of device used by the student.  
- **9. Internet_Connection_Quality** – Quality of the student’s internet connection.  

***Course Information***
- **10. Course_ID** – Unique identifier of the course.  
- **11. Course_Name** – Name of the course enrolled.  
- **12. Category** – Category/subject area of the course.  
- **13. Course_Level** – Difficulty level of the course.  
- **14. Course_Duration_Days** – Duration of the course in days.  
- **15. Instructor_Rating** – Rating of the course instructor.  

***Student Engagement Metrics***
- **16. Login_Frequency** – Number of login sessions by the student.  
- **17. Average_Session_Duration_Min** – Average time (in minutes) spent per session.  
- **18. Video_Completion_Rate** – Percentage of course videos completed.  
- **19. Discussion_Participation** – Participation level in course discussion forums.  
- **20. Time_Spent_Hours** – Total hours spent on the course/content.  
- **21. Days_Since_Last_Login** – Days since the last login.  
- **22. Notifications_Checked** – Number of notifications viewed or clicked.  
- **23. Peer_Interaction_Score** – Score indicating peer interactions.  
- **24. Assignments_Submitted** – Number of assignments submitted.  
- **25. Assignments_Missed** – Number of assignments missed.  
- **26. Quiz_Attempts** – Number of quiz attempts made.  
- **27. Quiz_Score_Avg** – Average quiz score.  
- **28. Project_Grade** – Grade of the final project.  
- **29. Progress_Percentage** – Percentage of course completion.  
- **30. Rewatch_Count** – Number of times lessons/videos were replayed.  

***Enrollment & Payment Details***
- **31. Enrollment_Date** – Date when the student enrolled.  
- **32. Payment_Mode** – Mode of payment used.  
- **33. Fee_Paid** – Total fee paid.  
- **34. Discount_Used** – Discount amount applied.  
- **35. Payment_Amount** – Final amount paid after discount.  

***App & Support Interactions***
- **36. App_Usage_Percentage** – Percentage of usage via mobile app.  
- **37. Reminder_Emails_Clicked** – Number of reminder emails opened or clicked.  
- **38. Support_Tickets_Raised** – Number of support tickets raised.  
- **39. Satisfaction_Rating** – Overall student satisfaction score.  

***Target Variable***
- **40. Completed (Target)** – Indicates whether the student completed the course (Yes/No or 1/0).  


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

from warnings import filterwarnings
filterwarnings('ignore')
plt.rcParams['figure.figsize'] = [20,15]

In [None]:
data = pd.read_csv("Master_Dataset.csv")
df = data.copy()

In [None]:
df.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
pd.set_option('display.max_columns',50)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# check for duplicates
df[df.duplicated()]

* No duplicates Present

In [None]:
# Checking for the presence of anomalies
for i in df.columns:
    print(i)
    print(df[i].unique())
    print()
    print(f'Number of unique values in {i} are:{len(df[i].unique())}')
    print()
    print('----------------------------------------------------------------------------------')

In [None]:
df['Discount_Used'].value_counts()

In [None]:
# Frequency between Fee_paid and Discount_Used
pd.crosstab(df['Fee_Paid'],df['Discount_Used'])

In [None]:
# Seprating the numeric and categorical columns as list
df_num_list = df.select_dtypes(include=np.number).columns.to_list()
df_cat_list = df.select_dtypes(include=object).columns.to_list()

In [None]:
# Number of numerical columns
len(df_num_list)

In [None]:
# Number of Categorical columns
len(df_cat_list)

In [None]:
# Seprating the numeric and categorical columns as dataframse
df_num = df.select_dtypes(include=np.number)
df_cat = df.select_dtypes(include=object)

## NULL VALUE TREATMENT

In [None]:
# Total percentage of null values
df.isnull().sum().sum()/len(df)*100

***Inference:***
* We see total 10% of null values present in the data

In [None]:
# Null value percentage for each column
df.isnull().sum()/len(df)*100

***AGE***

In [None]:
df['Age'].skew()

In [None]:
df['Age'].agg(['mean','median'])

In [None]:
df['Age'].value_counts()

***Inference***
* We can say that the above age is bit right skewed and we see lot of learners from 17 to 30 age group

In [None]:
plt.rcParams['figure.figsize'] = [20,10]

In [None]:
sns.boxplot(x= df['Age'])
plt.show()

In [None]:
# Replacing the null values of age column with the value median(25)
df['Age'] = df['Age'].fillna(df['Age'].median())

In [None]:
print(df['Age'].isnull().sum())

***Login_Frequency***

In [None]:
df['Login_Frequency'].skew()

In [None]:
df['Login_Frequency'].agg(['mean','median'])

In [None]:
df['Login_Frequency'].value_counts()

In [None]:
# Replacing the null values of Login_Frequency column with the value median(5)
df['Login_Frequency'] = df['Login_Frequency'].fillna(df['Login_Frequency'].median())

In [None]:
print(df['Login_Frequency'].isnull().sum())

***Discussion_Participation***

In [None]:
df['Discussion_Participation'].skew()

In [None]:
df['Discussion_Participation'].agg(['mean','median'])

In [None]:
df['Discussion_Participation'].value_counts()

In [None]:
# Replacing the null values of Discussion_Participation column with the value median(2)
df['Discussion_Participation'] = df['Discussion_Participation'].fillna(df['Discussion_Participation'].median())

In [None]:
print(df['Discussion_Participation'].isnull().sum())

***Assignments_Submitted***

In [None]:
df['Assignments_Submitted'].skew()

In [None]:
df['Assignments_Submitted'].agg(['mean','median'])

In [None]:
df['Assignments_Submitted'].mode()[0]

In [None]:
df['Assignments_Submitted'].value_counts()

In [None]:
df['Assignments_Submitted'] = df['Assignments_Submitted'].bfill().ffill()

In [None]:
df['Assignments_Submitted'].value_counts()

In [None]:
print(df['Assignments_Submitted'].isnull().sum())

***Progress_Percentage***

In [None]:
df['Progress_Percentage'].skew()

In [None]:
df['Progress_Percentage'].agg(['mean','median'])

In [None]:
df['Progress_Percentage'].value_counts()

In [None]:
# We are filling the null values using bfill and ffill technique
df['Progress_Percentage'] = df['Progress_Percentage'].bfill().ffill()

In [None]:
print(df['Assignments_Submitted'].isnull().sum())

***Rewatch_Count***

In [None]:
df['Rewatch_Count'].skew()

In [None]:
df['Rewatch_Count'].agg(['mean','median'])

In [None]:
df['Rewatch_Count'].value_counts()

In [None]:
# We are filling the null values using bfill and ffill technique
df['Rewatch_Count'] = df['Rewatch_Count'].bfill().ffill()

In [None]:
print(df['Rewatch_Count'].isnull().sum())

***App_Usage_Percentage***

In [None]:
df['App_Usage_Percentage'].skew()

In [None]:
df['App_Usage_Percentage'].agg(['mean','median'])

In [None]:
df['App_Usage_Percentage'].value_counts()

In [None]:
# Replacing the null values of Discussion_Participation column with the value median(2)
df['App_Usage_Percentage'] = df['App_Usage_Percentage'].fillna(df['App_Usage_Percentage'].median())

In [None]:
print(df['App_Usage_Percentage'].isnull().sum())

***Satisfaction_Rating***

In [None]:
df['Satisfaction_Rating'].skew()

In [None]:
df['Satisfaction_Rating'].agg(['mean','median'])

In [None]:
df['Satisfaction_Rating'].value_counts()

In [None]:
# We are filling the null values using bfill and ffill technique
df['Satisfaction_Rating'] = df['Satisfaction_Rating'].bfill().ffill()

In [None]:
print(df['App_Usage_Percentage'].isnull().sum())

In [None]:
df.isnull().sum().sum()

## Outlier Detection

In [None]:
plt.rcParams['figure.figsize'] = [20,40]

In [None]:
# For Numeric Columns
t = 1
for i in df_num_list:
    plt.subplot(8,3,t)
    plt.title(f'Box plot of {i}')
    sns.boxplot(x=df[i])
    t +=1

plt.tight_layout()
plt.show()

Inference
* Time_Spent_Hours:
* Days_Since_Last_Login
* Quiz_Score_Avg
* Project_Grade
* Progress_Percentage

***The above mentioned columns has more number of outliers*** 

In [None]:
t = 1
for i in df_num_list:
    plt.subplot(8,3,t)
    plt.title(f'Box plot of {i}')
    sns.histplot(x=df[i],kde=True)
    t +=1

plt.tight_layout()
plt.show()

***Inference***
1. Age is right-skewed with most students aged 18–28.  
2. Course durations occur in fixed predefined values.  
3. Instructor ratings mostly fall between 4.0–4.9.  
4. Login frequency is low to moderate for most students.  
5. Session duration centers around ~40 minutes.  
6. Video completion rate is moderate for most students.  
7. Discussion participation is very low overall.  
8. Time spent is right-skewed, mostly 8–16 hours.  
9. Days since last login is heavily right-skewed.  
10. Notifications checked is low for most students.  
11. Peer interaction scores follow a mild bell-shaped pattern.  
12. Assignments submitted shows clustered counts.  
13. Assignments missed is low for most students.  
14. Quiz attempts vary but mostly remain under 10.  
15. Quiz score average forms a strong bell-curve.  
16. Project grades show a near-normal distribution.  
17. Progress percentage is normally distributed.  
18. Rewatch count is low for most students.  
19. Payment amounts cluster toward lower ranges.  
20. App usage percentage is moderately distributed.  
21. Reminder emails clicked is very low for most users.  
22. Support tickets raised is extremely low.  
23. Satisfaction rating is slightly right-skewed but mostly positive.  


In [None]:
for i in df_num:
    print(f'Skewness of the column {i} is : {df[i].skew()}')

***1.Using IQR Method***

In [None]:
Q1 = df_num.quantile(0.25)
Q3 = df_num.quantile(0.75)

In [None]:
IQR = Q3 - Q1

In [None]:
lower_whis  = Q1 - (1.5*IQR)

In [None]:
upper_whis  = Q3 + (1.5*IQR)

In [None]:
outliers = df[((df_num <lower_whis) | (df_num > upper_whis)).any(axis=1)]

In [None]:
len(outliers)

* We can see 18256 records of outliers

In [None]:
non_outliers = df[~((df_num <lower_whis) | (df_num > upper_whis)).any(axis=1)]

In [None]:
len(non_outliers)

***Inference:***
* We can see 81744 records of non-outliers

## Univariate Analysis

In [None]:
plt.rcParams['figure.figsize'] = [20,40]

### Numeric

***HistPlot***

In [None]:
t = 1
for i in df_num_list:
    plt.subplot(8,3,t)
    plt.title(f'HistPlot of {i}')
    sns.histplot(df[i])
    t +=1
plt.tight_layout()
plt.show()

***Inference***
1. Age is right-skewed with most students aged 18–28.  
2. Course durations occur in fixed predefined values.  
3. Instructor ratings mostly fall between 4.0–4.9.  
4. Login frequency is low to moderate for most students.  
5. Session duration centers around ~40 minutes.  
6. Video completion rate is moderate for most students.  
7. Discussion participation is very low overall.  
8. Time spent is right-skewed, mostly 8–16 hours.  
9. Days since last login is heavily right-skewed.  
10. Notifications checked is low for most students.  
11. Peer interaction scores follow a mild bell-shaped pattern.  
12. Assignments submitted shows clustered counts.  
13. Assignments missed is low for most students.  
14. Quiz attempts vary but mostly remain under 10.  
15. Quiz score average forms a strong bell-curve.  
16. Project grades show a near-normal distribution.  
17. Progress percentage is normally distributed.  
18. Rewatch count is low for most students.  
19. Payment amounts cluster toward lower ranges.  
20. App usage percentage is moderately distributed.  
21. Reminder emails clicked is very low for most users.  
22. Support tickets raised is extremely low.  
23. Satisfaction rating is slightly right-skewed but mostly positive.  


***DistPlot***

In [None]:
t = 1
for i in df_num_list:
    plt.subplot(8,3,t)
    sns.distplot(df_num[i],color='green') #hist=False to remove histogram
    plt.title(f'DistPlot of {i}')
    t +=1
plt.tight_layout()
plt.show()

***Inference:***
* We se any of the graph are skwed which indicates presence of Outliers

***KDEPlot***

In [None]:
t = 1
for i in df_num_list:
    plt.subplot(8,3,t)
    plt.title(f'KDE Plot of {i}')
    sns.kdeplot(df[i])
    t +=1
plt.tight_layout()
plt.show()

***Inferance:***
* We se any of the graph are skwed which indicates presence of Outliers

In [None]:
t = 1
for i in df_num_list:
    plt.subplot(8,3,t)
    plt.title(f'Box Plot of {i}')
    sns.boxplot(x=df[i])
    t +=1
plt.tight_layout()
plt.show()

Inferance:
* We presence of outliers in almost all the graphes expect in the columns:Course_duration_days,Instructer_rating,Vider_completion_rate,

In [None]:
# five point summary fro numeric columns
df_num.describe().T

In [None]:
for i in df_num:
    print(f'Skewness of the column {i} is : {df[i].skew()}')

In [None]:
for i in df_num:
    print(f'Kurtosis of the column {i} is : {df[i].kurt()}')

### Categorical

In [None]:
plt.rcParams['figure.figsize'] = [20,40]

***Unique values and number of unique values for categoric columns***

In [None]:
for i in df_cat_list:
    print(i)
    print()
    print(df_cat[i].unique)
    print()
    print(f'Number of unique values in the column {i} are : {len(df_cat[i].unique())}')
    print()
    print('-----------------------------------------------------------------------------------------------')

***Value counts for categorical***

In [None]:
for i in df_cat_list:
    
    print(i)
    print()
    print(df_cat[i].value_counts())
    print()
    print('-----------------------------------------------------------------------------------------------')

In [None]:
cat_graph_list = ['Gender','Education_Level','Employment_Status','City','Device_Type','Internet_Connection_Quality','Course_ID'
             ,'Course_Name','Category','Course_Level','Payment_Mode','Fee_Paid','Discount_Used','Completed']

***Pie Chart***

In [None]:
t = 1
for i in cat_graph_list:
    plt.subplot(6,3,t)
    plt.title(f'Pie chart of {i}')
    df[i].value_counts().plot(kind='pie',autopct='%.2f%%')
    t +=1
plt.tight_layout()
plt.show()

***Inferance***
1. Females form the majority of the student population.  
2. Most students hold a Bachelor's degree.  
3. Employment status is dominated by Students, followed by Employed individuals.  
4. City distribution is diverse, with no single city heavily dominating.  
5. Mobile devices are the most commonly used for accessing the platform.  
6. Internet quality is mostly Medium for users.  
7. Course enrollments are fairly balanced across available Course_IDs.  
8. Popular courses include Python, Machine Learning, and AI-related subjects.  
9. Programming is the most common course category.  
10. Most learners enroll in Beginner-level courses.  
11. Payment modes are widely distributed, with UPI used most frequently.  
12. A majority of users have paid their course fees.  
13. Most students do not use discounts.  
14. Course completion is nearly balanced between completed and not completed.  

***Count Plot***

In [None]:
t = 1
for i in cat_graph_list:
    plt.subplot(6,3,t)
    plt.title(f'CountPlot of {i}')
    sns.countplot(x=df[i])
    t +=1
plt.tight_layout()
plt.show()

***Inference***
1. There are more female learners when compared to make learners
2. Most of the learners have bachelor as their maximum education level
3. Most of the learners are students as well as employed persons 
4. The most preferred course is python basics and the least preferred course is UI/UX Design Fundamentals 
5. Majority of the learners have chosen the programming category 
6.  Half of the learners have chosen beginner course level.
7. More than 75% of the learners have paid their fees and they haven't availed any discounts
8. More than half of the learners haven't successfully completed their course.

## Bivariate Analysis

### Numeric vs Numeric

***Scatter plot***

In [None]:
df_num_list

In [None]:
scatter_num = ['Age','Project_Grade','Instructor_Rating','Time_Spent_Hours','Peer_Interaction_Score','Assignments_Submitted','Assignments_Missed',
              'Quiz_Attempts','Quiz_Score_Avg','Progress_Percentage','Satisfaction_Rating']

In [None]:
plt.rcParams['figure.figsize'] = [20,70]

In [None]:
t = 1
for i in scatter_num:
    for j in scatter_num:
        if i!=j:
            plt.subplot(22,5,t)
            plt.title(f'Scatter plot between {i} and {j}')
            sns.scatterplot(x=df[i],y=df[j])
            t +=1
plt.tight_layout()
plt.show()

***Inference***

1. **Video Completion Rate vs Progress Percentage**  
   There is a clear positive trend — students who watch more videos also have higher overall course progress.

2. **Assignments Submitted vs Progress Percentage**  
   Students who submit more assignments tend to show better progress in the course.

3. **Time Spent vs Progress Percentage**  
   Learners who spend more hours on the platform generally progress more, though the relation is not perfectly linear.

4. **Quiz Score Avg vs Video Completion Rate**  
   Higher video completion is linked with slightly better quiz scores, showing that consistent video learning helps performance.

5. **Project Grade vs Assignments Submitted**  
   Students who submit more assignments usually achieve better project grades.

6. **Satisfaction Rating vs Instructor Rating**  
   Higher instructor ratings are associated with higher student satisfaction, indicating instructor quality impacts user experience.

7. **Age vs Time Spent**  
   Age does not show a strong relationship with time spent; learners across ages behave similarly.

8. **Fee Paid vs Completion**  
   Students who paid the fee tend to complete the course more often than free learners.



In [None]:
plt.rcParams['figure.figsize'] = [20,20]

In [None]:
sns.heatmap(df_num.corr(),annot=True)
plt.show()

***Inference:***

In The below We can see the Highly Corelated columns:

* Instructor_Rating vs Corse_Duration_Days(0.64)
*  Video_Completion_Rate vs Progress_Percentage(0.59)
* Progress_Percentage vs Assignmets_Submitted(0.83)
* Assignments_Missed vs Progress_Percentage(-0.82)
* Assignments_Missed vs Assignments_Submitted
* 
***So the above menctnoid columns are Highly corelated***

In [None]:
biv_graph = ['Video_Completion_Rate' , 'Progress_Percentage','Assignments_Submitted', 
'Course_Duration_Days' , 'Instructor_Rating','Assignments_Missed']

In [None]:
plt.rcParams['figure.figsize'] = [20,40]

In [None]:
t = 1
for i in biv_graph:
    for j in biv_graph:
        if i!=j:
            plt.subplot(9,4,t)
            plt.title(f'Scatter plot between {i} and {j}')
            sns.scatterplot(x=df[i],y=df[j])
            t +=1
plt.tight_layout()
plt.show()

***Inference***

1. **Video_Completion_Rate ↔ Progress_Percentage**  
   Strong positive relationship. Higher video completion leads to higher course progress.

2. **Assignments_Submitted ↔ Progress_Percentage**  
   More submitted assignments result in higher overall progress.

3. **Assignments_Missed ↔ Progress_Percentage**  
   Students who miss more assignments show lower progress.

4. **Assignments_Submitted ↔ Assignments_Missed**  
   Clear inverse pattern—students who submit many assignments miss fewer.

5. **Video_Completion_Rate ↔ Assignments_Submitted**  
   Students completing more videos tend to submit more assignments (higher engagement).

6. **Course_Duration_Days ↔ Progress_Percentage**  
   Longer courses generally show lower or more scattered progress (possible difficulty/fatigue effect).


**Inference**
***Strong Positive Correlation***
* Video completion and progress percentage 
* Assignment submitted| and progress percentage 
* Course Duration Days and Instructor Rating


***Strong Negative Correlation***
* Progress percentage and assignment missed

### Numeric vs Categoric

In [None]:
cat_num_graph = ['Satisfaction_Rating','App_Usage_Percentage','Project_Grade','Quiz_Score_Avg','Assignments_Submitted',
                'Time_Spent_Hours','Video_Completion_Rate','Course_Duration_Days','Age']

In [None]:
cat_cat_graph = ['Gender','Education_Level','Course_Name','Course_Level','Category']

In [None]:
plt.rcParams['figure.figsize'] = [20,50]

In [None]:
t = 1;
for i in cat_num_graph:
    for j in cat_cat_graph:
        plt.subplot(15,3,t)
        plt.title(f'Line plot between {i} and {j}')
        sns.lineplot(x=df[j],y=df[i])
        t +=1
plt.tight_layout()
plt.show()

***Inference***

1. Satisfaction rating is slightly higher for intermediate learners and for students with a Master’s education. Some courses also show better satisfaction than others.

2. App usage is higher among students with a Master’s degree and slightly higher for females. Programming courses show the highest app usage.

3. Project grades are generally better for male students and improve as education level increases. Intermediate level students also score better.

4. Quiz scores are higher for students with higher education, especially Master-level. Programming courses also show better quiz scores than business-related ones.

5. Students with higher education, especially up to Master’s level, submit more assignments. Programming courses again show the highest submission levels.

6. Female learners spend more time on the platform. Time spent is highest at the intermediate course level and increases with education level up to Master’s.

7. Video completion rate improves with higher education levels. Programming courses and intermediate-level learners show the best completion rates.

8. Course duration is generally higher for students with higher education and for programming-related courses.

9. Age slightly varies across groups. Intermediate-level learners tend to be a bit older, and the “Other” gender category shows slightly higher age on average.



### Categoric vs Categoric

In [None]:
plt.rcParams['figure.figsize'] = [20,10]

***Gender vs Education_Level***

In [None]:
pd.crosstab(df['Gender'],df['Education_Level']).T

In [None]:
sns.countplot(hue=df['Gender'],x=df['Education_Level'])
plt.show()

***Inference:***
* We see lot  people who have done their Bachelors and slightly females are dominating

***Gender vs Employment_Status***

In [None]:
pd.crosstab(df['Gender'],df['Employment_Status']).T

In [None]:
sns.countplot(hue=df['Gender'],x=df['Employment_Status'])
plt.show()

***Inference:***
* We see lot of Students and people who are Employed pursuing this course

***Gender vs Completed***

In [None]:
pd.crosstab(df['Gender'],df['Completed']).T

In [None]:
sns.countplot(hue=df['Gender'],x=df['Completed'])
plt.show()

***Inferance:***
* We see amount of people who have completed and not completed are almost equal while consurding the gender Females are dominating

***Education_Level VS Employment_Status***

In [None]:
pd.crosstab(df['Education_Level'],df['Employment_Status']).T

In [None]:
sns.countplot(hue=df['Education_Level'],x=df['Employment_Status'])
plt.show()

***Inference:***
* Bachelor’s degree holders dominate across all employment categories, with Students and Employed groups having the highest overall counts compared to Self-Employed and Unemployed individuals.

***Completed VS Course_Name***

In [None]:
pd.crosstab(df['Course_Name'],df['Completed']).T

In [None]:
sns.countplot(hue=df['Completed'],x=df['Course_Name'])
plt.show()

***Inference:***
* We see lot of people chossing 'Pytho_Basics' and 'Data Analysis with python'

***Completed VS Category***

In [None]:
pd.crosstab(df['Category'],df['Completed']).T

In [None]:
sns.countplot(hue=df['Completed'],x=df['Category'])
plt.show()

***Inferance:***
* We see more people opting Programming and followed by Marketing category.

***Course_Level VS Completed***

In [None]:
pd.crosstab(df['Course_Level'],df['Completed']).T

In [None]:
sns.countplot(hue=df['Completed'],x=df['Course_Level'])
plt.show()

***Inference:***
* We see people doing lot of Beginner and Intermediate Corse_levels and less Advanced Course_level

## Target Variable Analysis

In [None]:
cat_num_graph = ['Instructor_Rating','Satisfaction_Rating','App_Usage_Percentage','Project_Grade','Quiz_Score_Avg','Assignments_Submitted',
                'Time_Spent_Hours','Video_Completion_Rate','Course_Duration_Days','Age']

In [None]:
plt.rcParams['figure.figsize'] = [15,70]

In [None]:
t = 1
for i in df_num_list:
    plt.subplot(12,2,t)
    plt.title(f'Bar plot of {i} and Completed')
    sns.barplot(x=df['Completed'],y=df[i])
    t +=1
plt.tight_layout()
plt.show()

***Inference:***
* We see almost the target variable is equal with all the columns 

In [None]:
t = 1
for i in df_num_list:
    plt.subplot(12,2,t)
    plt.title(f'Line plot of {i} and Completed')
    sns.lineplot(x=df['Completed'],y=df[i])
    t +=1
plt.tight_layout()
plt.show()

***Inference***

1. **Completed learners spend more time on the course.**  
   They log more hours and watch more videos compared to non-completers.

2. **Video Completion Rate is strongly higher for completed students.**  
   This is one of the clearest indicators of completion.

3. **Assignments Submitted is much higher for students who completed.**  
   Completing more assignments seems necessary for finishing the course.

4. **Students who completed also participate more in discussions.**  
   More engagement → higher chance of completion.

5. **Instructor Interaction Score is higher among completed users.**  
   Students who ask more questions or interact more tend to succeed.

6. **Quiz Score Average and Project Grades are higher for completed users.**  
   Better performance goes hand-in-hand with finishing the course.

7. **Progress Percentage is obviously much higher for completed users.**  
   This is expected, but the difference is large and clear.

8. **App Usage Percentage is higher for students who completed.**  
   They use the learning app more consistently.

9. **Payment Amount is slightly higher for completed users.**  
   This may indicate commitment: paid students finish more often.

10. **Reminder Emails Clicked is higher for completers.**  
   They respond more to reminders and stay on track.

11. **Support Tickets Raised is slightly higher among completers.**  
   They seek help when needed, which may prevent drop-off.

12. **Age and Instructor Rating don't show significant differences.**  
   These factors do not strongly affect whether a learner completes.



In [None]:
cat_target_graph = ['Gender','Education_Level','Employment_Status','City','Course_Name','Device_Type','Internet_Connection_Quality',
                 'Course_Level','Category','Fee_Paid','Discount_Used']

In [None]:
plt.rcParams['figure.figsize'] = [15,30]

In [None]:
t = 1
for i in cat_target_graph:
    plt.subplot(6,2,t)
    plt.title(f'Count Plot of {i} and Completed')
    sns.countplot(hue=df['Completed'],x=df[i])
    t +=1
plt.tight_layout()
plt.show()

***Inference***

1.There are quite few more learners both in male and female gender 
2.

## Stats Test

### Chi Square test for independence

In [None]:
from scipy import stats

In [None]:
for i in df_cat_list:
    if i != 'Completed':
        print('Hypothesis:')
        print(f'Ho: {i} and Completed are independent')
        print(f'Ha: {i} and Completed are dependent')
    
        f_obs = pd.crosstab(df[i], df['Completed'])

        chi_stats, p_val, dof, f_exp = stats.chi2_contingency(f_obs)
        print()
        print('Conclusion:')
        if p_val > 0.05:
            print('Fails to reject Ho')
            print(f'Ho: {i} and Completed are independent')
        else:
            print('Reject Ho')
            print(f'Ha: {i} and Completed are dependent')
        print()
        print('------------------')

***Inference:***
* Gender and Completed are dependent
* Education_Level and Completed are dependent
* Employment_Status and Completed are dependent
* Device_Type and Completed are dependent
* Internet_Connection_Quality and Completed are dependent
* Payment_Mode and Completed are dependent
* Fee_Paid and Completed are dependent

### T-Test

***Independent two tail t-test***

***Assumptions:***

* N>30,pop std is not known so it is a t-test
* We are assuming it as a two tail t-test

* “We use a two-tailed t-test because we are testing for any significant difference between the group means, without assuming whether one group’s mean is higher or lower.”
* Out of the 23 numeric variables,we manually picked some of the most significant variables that we can perform the t-test
* The numerical columns are 'Course_Duration_Days','Login_Frequency','Average_Session_Duration_Min','Video_Completion_Rate',
'Discussion_Participation','Time_Spent_Hours','Days_Since_Last_Login','Assignments_Submitted','Quiz_Score_Avg',
'Progress_Percentage'

In [None]:
t_test_num = ['Course_Duration_Days','Login_Frequency','Average_Session_Duration_Min','Video_Completion_Rate',
'Discussion_Participation','Time_Spent_Hours','Days_Since_Last_Login','Assignments_Submitted','Quiz_Score_Avg',
'Progress_Percentage']

In [None]:
len(df[df['Completed'] == 'Not Completed'])

In [None]:
for i in t_test_num:
    comp = df[df['Completed'] == 'Completed'][i]
    no_comp = df[df['Completed'] == 'Not Completed'][i]
    n1 = len(df[df['Completed'] == 'Completed'])
    n2 = len(df[df['Completed'] == 'Not Completed'])
    dof = (n1+n2)-2
    alpha = 0.05

    print('Hypothesis:')
    print()
    print(f"Ho : Mean of the {i} is SAME for 'Completed' and  'Not Completed'(μ(Completed) = μ(Not Completed))")
    print(f"Ha : Mean of the {i} is NOT SAME for 'Completed' and  'Not Completed'(μ(Completed) != μ(Not Completed))")
    print()

    t_stats,p_val = stats.ttest_ind(comp,no_comp,alternative='two-sided')
    t_critic = stats.t.isf((alpha/2),dof)

    if (p_val >= 0.05) and (abs(t_stats) <= t_critic):
        print('CONCLUSION:Fails to reject Ho')
        print()
        print(f"Ho : Mean of the {i} is SAME for 'Completed' and  'Not Completed'")
    else:
        print('CONCLUSION:Reject Ho')
        print(f"Ha : Mean of the {i} is NOT SAME for 'Completed' and  'Not Completed'")
            
    print()
    print('------------------')

***Inference***

* Only Mean of the Course_Duration_Days is SAME for 'Completed' and  'Not Completed'

* Where as the means of these Login_Frequency','Average_Session_Duration_Min','Video_Completion_Rate','Discussion_Participation',
'Time_Spent_Hours','Days_Since_Last_Login','Assignments_Submitted','Quiz_Score_Avg','Progress_Percentage' are not same for completed and not completed

## Encoding

In [None]:
for i in df_cat_list:
    print()
    print(i)
    print()
    print(df[i].unique())
    print()
    print('-------------------------------------------------------------------------')

* n-1 dummy -> Fee_Paid,Discount_Used,Completed
* ordinal -- > Education_Level,Internet_Connection_Quality,Course_Level
* label -- > Geder,Employment_Status,City,Device_Type,Course_Name,Payment_Mode,Category
* drop-->Student_ID,Name,Course_ID,Enrollment_Date

In [None]:
dummy = ['Fee_Paid','Discount_Used','Completed']
ordinal = ['Education_Level','Internet_Connection_Quality','Course_Level']
label = ['Geder','Employment_Status','City','Device_Type','Course_Name','Payment_Mode','Category']
drop = ['Student_ID','Name','Course_ID','Enrollment_Date']

In [None]:
df_num_list

#### Dropping the columns

In [None]:
# Dropping the columns Student_ID,Name,Course_ID,Enrollment_Date
for i in drop:
    df.drop(i,inplace=True,axis=1)

In [None]:
df.shape

#### N-1 Dummy

In [None]:
# Performimg the dummy encoding for the columns 'Fee_Paid','Discount_Used','Completed'
dummy = ['Fee_Paid','Discount_Used','Completed']
for i in dummy:
    df[i] = pd.get_dummies(df[i],drop_first=True,dtype='int')

In [None]:
df[['Fee_Paid','Discount_Used','Completed']]

#### Ordinal Encoding

In [None]:
# Performing ordinal encoding on columns 'Education_Level','Internet_Connection_Quality','Course_Level'
ordinal = ['Education_Level','Internet_Connection_Quality','Course_Level']


***Education_Level***

In [None]:
df['Education_Level'].unique()

In [None]:
# Education_Level
from sklearn.preprocessing import OrdinalEncoder
od = OrdinalEncoder(categories=[[ 'PhD','Master', 'Bachelor','Diploma','HighSchool']])
df['Education_Level'] = od.fit_transform(df[['Education_Level']])

In [None]:
df['Education_Level'].unique()

***Internet_Connection_Quality***

In [None]:
df['Internet_Connection_Quality'].unique()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
od = OrdinalEncoder(categories=[[ 'High','Medium', 'Low']])
df['Internet_Connection_Quality'] = od.fit_transform(df[['Internet_Connection_Quality']])

In [None]:
df['Internet_Connection_Quality'].unique()

***Course_Level***

In [None]:
df['Course_Level'].unique()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
od = OrdinalEncoder(categories=[[ 'Advanced','Intermediate', 'Beginner']])
df['Course_Level'] = od.fit_transform(df[['Course_Level']])

In [None]:
df['Course_Level'].unique()

#### Label Encoding

In [None]:
# Performing label encoding on columns 'Geder','Employment_Status','City','Device_Type','Course_Name','Payment_Mode','Category'
label = ['Gender','Employment_Status','City','Device_Type','Course_Name','Payment_Mode','Category']
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
for i in label:
    df[i] = lb.fit_transform(df[i])


In [None]:
df[['Gender','Employment_Status','City','Device_Type','Course_Name','Payment_Mode','Category']]

## Base Model

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('Completed',axis=1)

In [None]:
y = df['Completed']

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=0.2,random_state=42)

### Scaling

***Robust Scaling***

***We have chossen Robust scaling because in our data set you see lot od presence of outliers***

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
rb = RobustScaler()

In [None]:
for i in df_num_list:
    xtrain[i] = rb.fit_transform(xtrain[[i]])
    xtest[i] = rb.transform(xtest[[i]])

***Adding Constant***

In [None]:
import statsmodels.api as sma

In [None]:
xtrain_c = sma.add_constant(xtrain)
xtest_c = sma.add_constant(xtest)

### Base Model - 1 - Logit

In [None]:
import statsmodels.api as sma

In [None]:
model1 = sma.Logit(ytrain,xtrain_c).fit()
print(model1.summary())

#### User def function for evaluation metrics

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score

In [None]:
summary = pd.DataFrame(columns=['Name', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

In [None]:
def metrics(name, ytest, ypred):
    global summary

    print(classification_report(ytest , ypred ))
    acc = round(accuracy_score(ytest, ypred),3)
    pre = round(precision_score(ytest, ypred),3)
    rec = round(recall_score(ytest, ypred),3)
    f1 = round(f1_score(ytest, ypred),3)
    
    result = pd.DataFrame({'Name':[name], 'Accuracy':[acc], 
                           'Precision':[pre], 'Recall':[rec], 'F1 Score':[f1]})
    summary = pd.concat([summary, result], ignore_index=True)
    print(summary)

In [None]:
ypred1_prob = model1.predict(xtest_c)

In [None]:
ypred1_prob

In [None]:
ypred1 = [0 if i<0.5 else 1 for i in ypred1_prob]
ypred1[:5]

In [None]:
metrics('Logit-Model-1(Test)',ytest,ypred1)

In [None]:
ypred1_prob_train = model1.predict(xtrain_c)
ypred1_prob_train[:5]

In [None]:
ypred1_train = [0 if i<0.5 else 1 for i in ypred1_prob_train]
ypred1_train[:5]

In [None]:
metrics('Logit-Model-1(Train)',ytrain,ypred1_train)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(ytest, ypred1)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred1_prob)


auc_score = roc_auc_score(ytest, ypred1_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

### Checking the imbalance of the target variable

In [None]:
df['Completed'].value_counts()

In [None]:
plt.rcParams['figure.figsize'] = [20,10]
df['Completed'].value_counts().plot(kind='bar')
plt.show()

***Inference:***

* We see slight More of 'yes' while compared to 'No' Which is negligible

## Model-2-KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()
model2 = knn.fit(xtrain,ytrain)
model2

In [None]:
ypred2 = model2.predict(xtest)
ypred2[:5]

In [None]:
metrics('KNN-Model-2(Test)',ytest,ypred2)

In [None]:
ypred2_train = model2.predict(xtrain)
ypred2_train[:5]

In [None]:
metrics('KNN-Model-2(Train)',ytrain,ypred2_train)

In [None]:
ypred2_prob = model2.predict_proba(xtest)
ypred2_prob[:5]

In [None]:
ypred2_prob = ypred2_prob[:,1]
ypred2_prob

In [None]:
cm = confusion_matrix(ytest, ypred2)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred2_prob)
fpr, tpr, threshold = roc_curve(ytest, ypred2_prob)


auc_score = roc_auc_score(ytest, ypred2_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-3-GaussianNB

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gnb = GaussianNB()
model3 = gnb.fit(xtrain,ytrain)
model3

In [None]:
ypred3 = model3.predict(xtest)
ypred3[:5]

In [None]:
metrics('GaussianNB-Model-3(Test)',ytest,ypred3)

In [None]:
ypred3_train = model3.predict(xtrain)
ypred3_train[:5]

In [None]:
metrics('GaussianNB-Model-3(Train)',ytrain,ypred3_train)

In [None]:
ypred3_prob = model3.predict_proba(xtest)
ypred3_prob[:5]

In [None]:
ypred3_prob = ypred3_prob[:,1]
ypred3_prob

In [None]:
cm = confusion_matrix(ytest, ypred3)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred3_prob)


auc_score = roc_auc_score(ytest, ypred3_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()


## Model-4-Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier(random_state=42)
model4 = dt.fit(xtrain,ytrain)
model4

In [None]:
ypred4 = model4.predict(xtest)
ypred4[:5]

In [None]:
metrics('Decision Tree-Model-4(Test)',ytest,ypred4)

In [None]:
ypred4_train = model4.predict(xtrain)
ypred4_train[:5]

In [None]:
metrics('Decision Tree-Model-4(Train)',ytrain,ypred4_train)

In [None]:
ypred4_prob = model4.predict_proba(xtest)
ypred4_prob[:5]

In [None]:
ypred4_prob = ypred4_prob[:,1]
ypred4_prob

In [None]:
cm = confusion_matrix(ytest, ypred4)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred4_prob)


auc_score = roc_auc_score(ytest, ypred4_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Feature Importance-DT

In [None]:
feature_imp = pd.DataFrame()
feature_imp['Features'] = xtrain.columns
feature_imp['Importance'] = dt.feature_importances_

In [None]:
feature_imp.sort_values(by='Importance',ascending=False,inplace=True)

In [None]:
plt.rcParams['figure.figsize'] = [20,10]
sns.barplot(y=feature_imp['Features'],x=feature_imp['Importance'])
plt.show()

In [None]:
ypred2_prob = model2.predict_proba(xtest)
ypred2_prob[:5]

## Model-5-Decision Tree(FI)

In [None]:
xtrain.drop(['Course_Level','Discount_Used'],axis=1,inplace=True)
xtest.drop(['Course_Level','Discount_Used'],axis=1,inplace=True)

***NOTE : We have Dropped the Columns -'Course_Level','Discount_Used' Here***

In [None]:
dt = DecisionTreeClassifier(random_state=42)
model5 = dt.fit(xtrain,ytrain)
model5

In [None]:
ypred5 = model5.predict(xtest)
ypred5[:5]

In [None]:
metrics('Decision Tree(FI)-Model-5(Test)',ytest,ypred5)

In [None]:
ypred5_train = model5.predict(xtrain)
ypred5_train[:5]

In [None]:
metrics('Decision Tree(FI)-Model-5(Train)',ytrain,ypred5_train)

In [None]:
ypred5_prob = model5.predict_proba(xtest)
ypred5_prob[:5]

In [None]:
ypred5_prob = ypred5_prob[:,1]
ypred5_prob

In [None]:
cm = confusion_matrix(ytest, ypred5)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred5_prob)


auc_score = roc_auc_score(ytest, ypred5_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-6-Decision Tree(Tuned)

In [None]:
tuned_paramaters = [{'criterion': ['entropy','gini'],
                     'max_depth': [5,10],  # 5,6,7,
                     'max_features': ["sqrt", "log2"], # it can be either square root of the feature or log of number of features
                     'min_samples_split': [2,5,8], # 1-3% of the total records
                     'min_samples_leaf': [1,5,9], # 1-3% of the min_sample_split
                     'max_leaf_nodes': [5,8]
                     }]

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
dt = DecisionTreeClassifier(random_state=42)
dt_grid = GridSearchCV(estimator=dt,param_grid=tuned_paramaters,cv=5,n_jobs=-1)
dt_grid_model6 = dt_grid.fit(xtrain,ytrain)

In [None]:
dt_grid_model6.best_params_

In [None]:
dt = DecisionTreeClassifier(criterion= 'entropy',
     max_depth= 5,
     max_features= 'sqrt',
     max_leaf_nodes= 8,
     min_samples_leaf= 1,
     min_samples_split= 2,
     random_state=42)
model6 = dt.fit(xtrain,ytrain)

In [None]:
ypred6 = model6.predict(xtest)
ypred6[:5]

In [None]:
metrics('Decision Tree(Tuned)-Model-6(Test)',ytest,ypred6)

In [None]:
ypred6_train = model6.predict(xtrain)
ypred6_train[:5]

In [None]:
metrics('Decision Tree(Tuned)-Model-6(Train)',ytrain,ypred6_train)

In [None]:
ypred6_prob = model6.predict_proba(xtest)
ypred6_prob[:5]

In [None]:
ypred6_prob = ypred6_prob[:,1]
ypred6_prob

In [None]:
cm = confusion_matrix(ytest, ypred6)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()


In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred6_prob)


auc_score = roc_auc_score(ytest, ypred6_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-7-Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(random_state=42)
model7 = rfc.fit(xtrain,ytrain)
model7

In [None]:
ypred7 = model7.predict(xtest)
ypred7[:5]

In [None]:
metrics('Random Forest-Model-7(Test)',ytest,ypred7)

In [None]:
ypred7_train = model7.predict(xtrain)
ypred7_train[:5]

In [None]:
metrics('Random Forest-Model-7(Train)',ytrain,ypred7_train)

In [None]:
ypred7_prob = model7.predict_proba(xtest)
ypred7_prob[:5]

In [None]:
ypred7_prob = ypred7_prob[:,1]
ypred7_prob

In [None]:
cm = confusion_matrix(ytest, ypred7)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred7_prob)


auc_score = roc_auc_score(ytest, ypred7_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-8-Random Forest(Tuned)

In [None]:
rfc_params =             [{'criterion': ['entropy'],
                     'n_estimators': [100],
                     'max_depth': [10, 12],
                     'max_features': ['sqrt', 'log2'],
                     'min_samples_split': [2, 4],
                     'min_samples_leaf': [5, 7],
                     'max_leaf_nodes': [9, 11]}]

In [None]:
rfc = RandomForestClassifier(random_state=42)

In [None]:
rfc_grid = GridSearchCV(estimator=rfc,param_grid=rfc_params,cv=5,n_jobs=-1)
rfc_grid_model8 = rfc_grid.fit(xtrain,ytrain)
rfc_grid_model8

In [None]:
rfc_grid_model8.best_params_

In [None]:
rfc = RandomForestClassifier(criterion= 'entropy',
     max_depth = 10,
     max_features='sqrt',
     max_leaf_nodes=11,
     min_samples_leaf=5,
     min_samples_split = 2,
     n_estimators = 100,
     random_state=42)
model8 = rfc.fit(xtrain,ytrain)

In [None]:
ypred8 = model8.predict(xtest)
ypred8[:5]

In [None]:
metrics('Random Forest(Tuned)-Model-8(Test)',ytest,ypred8)

In [None]:
ypred8_train = model8.predict(xtrain)
ypred8_train[:5]

In [None]:
metrics('Random Forest(Tuned)-Model-8(Train)',ytrain,ypred8_train)

In [None]:
ypred8_prob = model8.predict_proba(xtest)
ypred8_prob[:5]

In [None]:
ypred8_prob = ypred8_prob[:,1]
ypred8_prob

In [None]:
cm = confusion_matrix(ytest, ypred8)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred8_prob)


auc_score = roc_auc_score(ytest, ypred8_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-9-AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
abcl = AdaBoostClassifier(random_state=42)
model9 = abcl.fit(xtrain,ytrain)
model9

In [None]:
ypred9 = model9.predict(xtest)
ypred9[:5]

In [None]:
metrics('AdaBoostClassifier-Model-9(Test)',ytest,ypred9)

In [None]:
ypred9_train = model9.predict(xtrain)
ypred9_train[:5]

In [None]:
metrics('AdaBoostClassifier-Model-9(Train)',ytrain,ypred9_train)

In [None]:
ypred9_prob = model9.predict_proba(xtest)
ypred9_prob[:5]

In [None]:
ypred9_prob = ypred9_prob[:,1]
ypred9_prob

In [None]:
cm = confusion_matrix(ytest, ypred9)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred9_prob)


auc_score = roc_auc_score(ytest, ypred9_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-10-AdaBoostClassifier(Tuned)

In [None]:
ada_params = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1]
}

In [None]:
abcl = AdaBoostClassifier(random_state=42)

In [None]:
abcl_grid = GridSearchCV(
    estimator=abcl,
    param_grid=ada_params,
    cv=5,
    n_jobs=-1
)
abcl_grid_model10 = abcl_grid.fit(xtrain, ytrain)

In [None]:
abcl_grid_model10.best_params_

In [None]:
abcl = AdaBoostClassifier(learning_rate = 0.1, n_estimators = 200,random_state=42)
model10 = abcl.fit(xtrain,ytrain)
model10

In [None]:
ypred10 = model10.predict(xtest)
ypred10[0:5]

In [None]:
metrics('AdaBoostClassifier(Tuned)-Model-10(Test)',ytest,ypred10)

In [None]:
ypred10_train = model10.predict(xtrain)
ypred10_train[0:5]

In [None]:
metrics('AdaBoostClassifier(Tuned)-Model-10(Train)',ytrain,ypred10_train)

In [None]:
ypred10_prob = model10.predict_proba(xtest)
ypred10_prob[:5]

In [None]:
ypred10_prob = ypred10_prob[:,1]
ypred10_prob

In [None]:
cm = confusion_matrix(ytest, ypred10)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred10_prob)


auc_score = roc_auc_score(ytest, ypred10_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-11-XGBClassifier

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb = XGBClassifier(random_state=42)
model11 = xgb.fit(xtrain,ytrain)
model11

In [None]:
ypred11 = model11.predict(xtest)
ypred11[:5]

In [None]:
metrics('XGBClassifier-Model-11(Test)',ytest,ypred11)

In [None]:
ypred11_train = model11.predict(xtrain)
ypred11_train[:5]

In [None]:
metrics('XGBClassifier-Model-11(Train)',ytrain,ypred11_train)

In [None]:
ypred11_prob = model11.predict_proba(xtest)
ypred11_prob[:5]

In [None]:
ypred11_prob = ypred11_prob[:,1]
ypred11_prob

In [None]:
cm = confusion_matrix(ytest, ypred11)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred11_prob)


auc_score = roc_auc_score(ytest, ypred11_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## Model-12-XGBClassifier(Tuned)

In [None]:
xgb_grid_param = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

In [None]:
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42
)

In [None]:
xgb_grid = GridSearchCV(
    estimator=xgb,
    param_grid=xgb_grid_param,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2
)
xgb_grid_model12 = xgb_grid.fit(xtrain, ytrain)
xgb_grid_model12

In [None]:
xgb_grid_model12.best_params_

In [None]:
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42,
    colsample_bytree = 0.8,
     learning_rate = 0.1,
     max_depth = 3,
     n_estimators = 100,
     subsample = 1.0
)
model12 = xgb.fit(xtrain,ytrain)
model12

In [None]:
ypred12 = model12.predict(xtest)
ypred12[:5]

In [None]:
metrics('XGBClassifier(Tuned)-Model-12(Test)',ytest,ypred12)

In [None]:
ypred12_train = model12.predict(xtrain)
ypred12_train[:5]

In [None]:
metrics('XGBClassifier(Tuned)-Model-12(Train)',ytrain,ypred12_train)

In [None]:
ypred12_prob = model12.predict_proba(xtest)
ypred12_prob[:5]

In [None]:
ypred12_prob = ypred12_prob[:,1]
ypred12_prob

In [None]:
cm = confusion_matrix(ytest, ypred12)

plt.rcParams['figure.figsize'] = [10,5]
conf_matrix = pd.DataFrame(data = cm, columns=['Predicted:0','Predicted1'], index = ['Actual0', 'Actual1'])
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(ytest, ypred12_prob)


auc_score = roc_auc_score(ytest, ypred12_prob)

plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'r--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')

plt.show()

## PCA

In [None]:
from sklearn.decomposition import PCA

In [None]:
rb = RobustScaler()

In [None]:
x_sc = X

In [None]:
for i in df_num_list:
    x_sc[i] = rb.fit_transform(X[[i]])

In [None]:
x_sc

In [None]:
mypca=PCA(n_components=0.90)

In [None]:
pca2=mypca.fit_transform(x_sc)
pca2_df=pd.DataFrame(pca2)
pca2_df.head()

***Train-Test-split-PCA***

In [None]:
xtrain_pca,xtest_pca,ytrain_pca,ytest_pca = train_test_split(x_sc,y,test_size=0.2,random_state=42)

In [None]:
model13 = sma.Logit(ytrain_pca,xtrain_pca).fit()
print(model13.summary())

In [None]:
ypred13_prob = model13.predict(xtest_pca)
ypred13_prob[:5]

In [None]:
ypred13 = [0 if i<0.5 else 1 for i in ypred13_prob]
ypred13[:5]

In [None]:
metrics('PCA-Logit-Model-13',ytest_pca,ypred13)

***DT-14***

In [None]:
dt = DecisionTreeClassifier(random_state=42)
model14 = dt.fit(xtrain_pca,ytrain_pca)
model14

In [None]:
ypred14 = model14.predict(xtest_pca)
ypred14[:5]

In [None]:
metrics('PCA-Decision Tree-Model-14',ytest_pca,ypred14)

***NOTE:As we see clearly see that the "PCA" is not imporing the metrics for the models***