# Introduction to data analysis - Spring 2023
## Mini-Project

Lior Ben Sidi
Yarin Katan

### Submission guidelines:
● Submission deadline: 08/08/23 at 23:55.
○ The submission box in Moodle will close 48 hours after this deadline. To avoid
penalties for late submissions (as stated in the syllabus), submit by this deadline.
● Submission in pairs only (unless special permission is given by the head TA)
● Submission must include (at least) two files (not a single zip):
○ One jupyter notebook with your answers to parts 1, 2, 3, and 4 with both markdown
and code cells. Markdown cells should contain brief explanations of your analyses.
No need to elaborate in this file - you will do that in the PDF - but it should be clear
enough that we know which questions the code relates to and what it does. Code
cells must enable complete reproduction of all your results. Your code should be
clearly documented.
○ One PDF of exported content of your jupyter notebook. The PDF has to contain your
outputs, similar to homework submission.
○ One PDF file with your answers to part 5.
○ You should merge the exported PDF and the PDF for part 5. You may use online free
tools to do that, like: https://tools.pdf24.org/en/merge-pdf
○ As listed in the syllabus, if you use generative AI tools, you must also submit a (third)
docx file with details on your usage. Refer to the syllabus for details.
● File names must be in the following format: final_project_ID1_ID2.pdf,
final_project_ID1_ID2.ipynb .
● You must adhere to the syllabus when you prepare and submit your work. Any deviation from
the syllabus and/or guidelines herein will result in point deduction.
● You can write in either Hebrew or English, but it’s better to use the language you are more
comfortable with.

### Instructions
For the course mini-project, you will work with a dataset of your choice (from a set of possible
datasets) to answer questions you are curious about.

### IMPORTANT NOTES, please read carefully:
● In all of the following analyses, you will likely need to make some choices regarding what
variables to include, whether to do some pre-processing (e.g., addressing missing values,
generating new variables), what techniques to use in the analysis, etc. Clearly state each
decision you made, explain why you made it and what might have been alternative choices.
● You can earn up to 15 bonus points for your project (can reach a maximum of 115 points) if
you do a particularly thoughtful analysis, involving either an additional (unprovided, but
somewhat complimentary) dataset, or analysis of a complex data type (e.g., text). Note,
simply using complex data or an additional data source does not guarantee a bonus. If you
ask an interesting question and think of an original way to address it, that will get you the
extra points.

### Part 1: Choose a dataset

#### Choose one of the following datasets:
● Adult Income dataset:
https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
● US Estimated Crimes dataset:
https://www.kaggle.com/datasets/tunguz/us-estimated-crimes
● Diabetes Prediction dataset:
https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
● US Married Couples in 1976 dataset:
https://www.kaggle.com/datasets/utkarshx27/labor-supply-data
● World Air Quality Index dataset:
https://www.kaggle.com/datasets/adityaramachandran27/world-air-quality-index-by-city-and-c
oordinates
● Mobile Phones dataset:
https://www.kaggle.com/datasets/artempozdniakov/ukrainian-market-mobile-phones-data
● Students Exam Scores dataset (Only use the version in the file named
‘Expanded_data_with_more_features.csv’, not the version in the other file):
https://www.kaggle.com/datasets/desalegngeb/students-exam-scores?select=Expanded_dat
a_with_more_features.csv

Once you have chosen your dataset:
1) State which dataset you chose.
2) Provide a brief (2-4 sentences) description of the dataset. What is this dataset about?
3) List the features in the dataset and their types.
4) List the number of records in the dataset.

In [84]:
#Import libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('tableau-colorblind10')
import seaborn as sns
sns.set_palette("viridis")
from tqdm import tqdm

In [85]:
df_diabetes = pd.read_csv('diabetes_prediction_dataset.csv')
df_diabetes

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0
...,...,...,...,...,...,...,...,...,...
99995,Female,80.0,0,0,No Info,27.32,6.2,90,0
99996,Female,2.0,0,0,No Info,17.37,6.5,100,0
99997,Male,66.0,0,0,former,27.83,5.7,155,0
99998,Female,24.0,0,0,never,35.42,4.0,100,0


#### Question 1

We chose the "Diabetes prediction dataset".

#### Question 2

The Diabetes prediction dataset is a comprehensive collection of medical data from patients,
offering valuable insights into diabetes prediction.
It encompasses factors like age, gender, BMI, hypertension, heart disease,
smoking history, HbA1c level, and blood glucose level.

#### Question 3

According to the data frame the List of features in the dataset and their types are:
1. gender - Categorical('Female', 'Male', 'Other')
2. age - Numeric continuous(float)
3. hypertension - Boolean(binary)
4. heart_disease - Boolean(binary)
5. smoking_history - Categorical
6. bmi - Numeric continuous(float)
7. HbA1c_level - Numeric continuous(float)
8. blood_glucose_level - Numeric discrete (integer) #Ask about the type
9. diabetes - Boolean(binary)

#### Question 4

In [86]:
print(f'the number of records in the dataset are: {df_diabetes.shape[0]}')

the number of records in the dataset are: 100000


### Part 2: Exploratory data analysis

In this part, you will do an initial exploration of the dataset you chose.
This part should serve the next parts.
That is, you should look at variables that can influence your analyses for parts (3) and (4).
Of course, you can (and probably should) also explore further, and/or use this as a way to motivate questions for parts (3) and (4).
You should explain why you are exploring the particular variables you
chose.

1. Show plots illustrating the distribution of at least 5 variables in your dataset.
Comment on anything interesting you observe.
2. Show plots illustrating bivariate relationships for at least 2 pairs of variables.
Explain what you observe (e.g., positive/negative correlation, no correlation, etc.).

In [87]:
# We dropped the 18 records of unknown gender('Other')
df_diabetes = df_diabetes[df_diabetes.gender != 'Other']
# We replaced unknown information('No info') of 'smoking_history' with NAN values
df_diabetes.replace('No Info', np.nan, inplace=True)
df_diabetes

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_diabetes.replace('No Info', np.nan, inplace=True)


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0
...,...,...,...,...,...,...,...,...,...
99995,Female,80.0,0,0,,27.32,6.2,90,0
99996,Female,2.0,0,0,,17.37,6.5,100,0
99997,Male,66.0,0,0,former,27.83,5.7,155,0
99998,Female,24.0,0,0,never,35.42,4.0,100,0
