# A Project for You



## Instructions

For this, you'll need to make several classification models and select the best one. Your notebook will illustrate and explain this process. 

### Detailed Requirements

<ol>
<li> You need one model per person. Decide on a slightly different approach for each person - you can change things like varaibles used, preparation, Regression or Bayes models, and othe things like that.</li>
<li> For each model, you must: </li>
    <ol>
    <li> Create a predictive model. </li>
    <li> Evaluate its performance. </li>
    <li> Create a markdown cell that explains your model choice and evaluation. </li>
    </ol>
<li> Overall, as a team, your team will select the best model from the selection. </li>
<li> <b> This needs to be in one workbook, containing: </b></li>
    <ol>
    <li> Initial data exploration/preparation from original data to 'ready for modeling'. Any preparation that is <b>common</b> to everyone should go here. Exact details will vary, but you want to explore the data visually as well as numerically. Focus on stuff that impacted decisions - what did you find in the data that caused you to take action or make a different choice? </li>
    <li> A markdown cell that briefly explains, ideally mostly in point form, what you found whilst exploring and what preparation you needed to do. Based on those findings, what are the different model approaches?</li>
    <li> Each person's model, evaluation, and explanation in their own section. </li>
    <li> An overall conclusion of what was best and how well it worked. Present your results in some kind of visualization. </li>
    </ol>
<li><b> In the workbook, please have the code that you settled on to explore, clean, model, and evaluate. In the explaination cells, give a brief explaination of what you did and how you got to what you settled on. </b> </li>
<li> In the data preparation, consider statistics stuff. Things such as transformations may be worth a try. </li>
<li> You don't need to spent an eternity tuning the models to get amazing results, we will worry more about that later. Make N reasonable choices on an approach to take based on what the data looks like, do the appropriate preparation, test the models, and observe the results. <b>The raw accuracy scores don't matter, only the approach.</b> </li>
</ol>

Overall, the workbook should clearly show what happened - you exlored the data, made some general processing steps, decided on different approaches based on what you found, created models according to each approach (model specific exploration/prep, model creation, evaluation), and an overall conclusion summarising what you found and what you would try next if you were to continue testing. In doing the data exploration especially, it is pretty normal to have several iterations of exploring and cleaning data - for example, you might add a filter for an outlier, look at a pair plot, adjust the filter, look at a pair plot, add another filter, look at a pair plot, and so on. Please don't show this literal process step by step, condense it down to what matters - something like, "here's the distributions, we did these outlier filters and removed these two features, here's the result", then continuing on. Basically, illustrate what happened and why, but not all of the back and forth or deliberation that went into it.

<b>Note:</b> I said that you can try Bayes as well; it isn't mandatory, but feel free. If you do, you'll need to likely do a little reading on the details of exactly what to use, there are a few variations on Bayes that aren't just 'Bayes' and you'll need to select. The documentation has a pretty good explainer (it is generally linkined in the 'user guide' link of each model's documentaion page, or google "sklearn naive bayes"), and things work more or less the same, you just need to meet the requirements of the specific model you choose.

### Submission and Marks
<ul>
<li> Check your work into the repository. I'm reading one file, so please make sure it is condensed. </li>
<li> Grade distribution: </li>
    <ul>
    <li> Your model - from the end of the common preparation to your evaluation and explaination. Was it done correctly. 25% </li>
    <li> Exploration and common preparation - code was clear, exploration was readable, preparation was done correctly, and explaination made sense. 25% </li>
    <li> Team choice on model approach - based on what you fond in exloration, you made reasonable choices on the different model approaches. 10% </li>
    <li> Overall conclusion - reasonable choice on best model, <b>thoughts on what you may try next if you were to keep testing</b>, and results presented clearly. 20% </li>
    <li> "Can I read and understand it?" - overall, you're explaining what you found, can I read it and understand what you did. 20% </li>
    </ul>
</ul>

#### Data Dictionary

ðŸ“˜ Overview

This dataset provides detailed information about students enrolled in various online courses. It includes demographic, behavioral, and performance features to predict whether a learner will complete the course or drop out.

ðŸ“Š Key Details
<ul>
<li>Rows: 100,000</li>
<li>Target Variable: Completed</li>
</ul>

ðŸ§  Feature Categories
<ul>
<li>Demographic: Gender, Age, Education_Level, Employment_Status</li>
<li>Course Info: Course_Level, Duration, Instructor_Rating</li>
<li>Engagement: Login_Frequency, Video_Completion_Rate, Discussion_Participation</li>
<li>Performance: Assignments_Submitted, Quiz_Score_Avg, Project_Grade, Progress_Percentage</li>
<li>Other: Payment_Mode, Discount_Used, App_Usage_Percentage</li>
</ul>

<b>Note:</b> The feature set is mixed, meaning there are both categorical and numerical variables present in the dataset. <b>Mixed featureds generally require different processing. We haven't done the tool that mixes these easily yet, that's fine and ok, we have stats instead. Right now, there are several decisions you can make to use both the numerical and categorical data together, think about how to represent the data and some of the options we have available for transformations.</b> This is part of what you want to figure out while exploring the data - you have a constraint on your ability to use mixed feature sets, yet you must use the data you <i>do</i> have, along with the different processing choices you can make, to create a model. Again, you are testing a few approaches against each other, the idea isn't to pick the best approach off the top of your head, it's to make reasonable attempts to narrow down what works best. This isn't a trick for one specific action, there's a bunch of things you can do that are reasonable. Later, we'll automate some of this, so we can try more things to work towards the 'best' model.

In [9]:
import pandas as pd

df = pd.read_csv("cleaned_student_data.csv")
TARGET_COL = "Completed"
print(df[TARGET_COL].describe())
df.sample(10)

count            100000
unique                2
top       Not Completed
freq              50970
Name: Completed, dtype: object


Unnamed: 0,Student_ID,Name,Gender,Education_Level,Employment_Status,City,Device_Type,Internet_Connection_Quality,Course_ID,Course_Name,...,Assignments_Submitted,Quiz_Attempts,Progress_Percentage,Enrollment_Date,Payment_Mode,Fee_Paid,Discount_Used,App_Usage_Percentage,Satisfaction_Rating,Completed
9982,STU109982,Vihaan Verma,Female,Master,Student,Mumbai,Mobile,Low,C104,Digital Marketing Essentials,...,5,3,50.1,25-04-2025,Scholarship,No,No,94,4.0,Completed
4677,STU104677,Aarav Gupta,Male,Bachelor,Employed,Pune,Laptop,High,C104,Digital Marketing Essentials,...,6,0,60.3,10-05-2025,UPI,Yes,No,60,5.0,Not Completed
65727,STU165727,Meera Bhardwaj,Female,Bachelor,Employed,Hyderabad,Mobile,Medium,C106,Machine Learning A-Z,...,5,5,48.0,15-05-2025,UPI,Yes,Yes,95,4.0,Not Completed
88331,STU188331,Meera Patel,Male,Bachelor,Unemployed,Bhopal,Laptop,Medium,C107,Statistics for Data Science,...,3,6,37.5,28-04-2024,Scholarship,No,No,82,4.0,Not Completed
31238,STU131238,Ananya Bhardwaj,Male,Bachelor,Student,Bengaluru,Mobile,High,C105,UI/UX Design Fundamentals,...,8,7,74.6,01-06-2024,Free,No,No,89,5.0,Completed
83504,STU183504,Kavya Iyer,Male,Bachelor,Student,Bhopal,Tablet,Medium,C106,Machine Learning A-Z,...,8,6,79.8,12-02-2025,Credit Card,Yes,No,56,5.0,Completed
77640,STU177640,Kavya Reddy,Other,Bachelor,Employed,Surat,Tablet,High,C101,Python Basics,...,3,2,41.4,02-03-2025,UPI,Yes,Yes,92,4.0,Completed
9157,STU109157,Arjun Shah,Female,PhD,Employed,Bengaluru,Mobile,Medium,C108,Excel for Business,...,4,4,54.7,29-07-2024,Credit Card,Yes,No,43,4.0,Not Completed
54271,STU154271,Rohan Patel,Male,Bachelor,Student,Ahmedabad,Mobile,High,C108,Excel for Business,...,6,3,68.2,08-05-2024,Free,No,No,28,4.0,Not Completed
39110,STU139110,Sakshi Gupta,Male,Bachelor,Employed,Hyderabad,Laptop,High,C106,Machine Learning A-Z,...,6,5,52.7,19-08-2025,Debit Card,Yes,No,89,4.0,Not Completed
