# Data Mining Midterm Project
### Why We Chose This Dataset:
This data set is extremely robust with information that is thought to be linked to health and nutrition. There are extensive subsets such as demographics, diet, exams, labs, and more that can be analyzed to find correlations that contribute positively or negatively to overall health. There are a lot of questions that can be answered with this data set; we will try to focus on a few we think are interesting to us
### Business Questions:
1. Can we identify habits/factors that positively or negatively affect health?
2. Is there a correlation between income level and overall health?
3. Can we identify the main diseases that affect certain demographics?
4. Does the presence of one disease increase the likelihood of the presence of another diease within an individual?
### Project Outline:
1. **Software Engineering** - Will build functions such as `dataLoader(), logger(), featurize(), cluster(),` and `dimRed()` to streamline data transformations within the dataframe(s).
2. **Data Engineering** - Will clean, organize, and transform the data for ease of use during the research and analysis portion of the project. Featurization. 
3. **Business Analysis** - Discover initial challenge questions for the dataset(s). Then evolve the questions as we iterate through the project. 
4. **Research** - After featurization, extract metadata on the full dataset and run clustering/dimentions reduction on various slices of the data. Metadata will be added to each cluster. Each iteration may vary on addition/deletion of features as we attempts to gain a conclusion on our business questions. 
### Resources:
- https://www.kaggle.com/code/gopalkholade/diabetes-prediction

In [176]:
import pandas as pd
import numpy as np
import re
import sklearn
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
import seaborn as sns
import plotly.express as px
from scipy.sparse.linalg import svds
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_classif
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

In [177]:
demo = pd.read_csv('nhnes/demographic.csv')
diet = pd.read_csv('nhnes/diet.csv')
exam = pd.read_csv('nhnes/examination.csv')
labs = pd.read_csv('nhnes/labs.csv')
ques = pd.read_csv('nhnes/questionnaire.csv')
meds = pd.read_csv('nhnes/medications.csv', encoding='latin1')

In [178]:
def null_handler(df):
    perc = 75.0
    min_col = int(((100-perc)/100)*df.shape[0] + 1)
    min_row = int(((100-perc)/100)*df.shape[1] + 1)
    new_df = df.dropna(axis=0, thresh=min_col)
    new_df = df.dropna(axis=1, thresh=min_row)
    return new_df
    
def get_nulls(df):
    sum_null = df.isnull().sum()
    df_shape = df.shape[0]
    null = 100 * (sum_null / df_shape)
    return pd.DataFrame({'nullPercent': null})

In [190]:
df = pd.concat([labs, exam, demo, diet, ques], axis=1)
df = df.loc[:,~df.columns.duplicated()]
df

Unnamed: 0,SEQN,URXUMA,URXUMS,URXUCR.x,URXCRS,URDACT,WTSAF2YR.x,LBXAPB,LBDAPBSI,LBXSAL,...,WHD080U,WHD080L,WHD110,WHD120,WHD130,WHD140,WHQ150,WHQ030M,WHQ500,WHQ520
0,73557.0,4.3,4.3,39.0,3447.6,11.03,,,,4.1,...,,40.0,270.0,200.0,69.0,270.0,62.0,,,
1,73558.0,153.0,153.0,50.0,4420.0,306.00,,,,4.7,...,,,240.0,250.0,72.0,250.0,25.0,,,
2,73559.0,11.9,11.9,113.0,9989.2,10.53,142196.890197,57.0,0.57,3.7,...,,,180.0,190.0,70.0,228.0,35.0,,,
3,73560.0,16.0,16.0,76.0,6718.4,21.05,,,,,...,,,,,,,,3.0,3.0,3.0
4,73561.0,255.0,255.0,147.0,12994.8,173.47,142266.006548,92.0,0.92,4.3,...,,,150.0,135.0,67.0,170.0,60.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10170,,,,,,,,,,,...,,,,,,150.0,26.0,,,
10171,,,,,,,,,,,...,,,,,,,,,,
10172,,,,,,,,,,,...,,,155.0,135.0,,195.0,42.0,,,
10173,,,,,,,,,,,...,,,,,,,,,,


In [191]:
perc = 50.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
df = df.dropna( axis=1, 
                thresh=min_count)
perc = 75.0
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
df = df.dropna( axis=0, 
                    thresh=min_count)

In [192]:
df

Unnamed: 0,SEQN,URXUMA,URXUMS,URXUCR.x,URXCRS,URDACT,LBXSAL,LBDSALSI,LBXSAPSI,LBXSASSI,...,SMDANY,SMAQUEX.y,WHD010,WHD020,WHQ030,WHQ040,WHD050,WHQ070,WHD140,WHQ150
0,73557.0,4.3,4.3,39.0,3447.6,11.03,4.1,41.0,129.0,16.0,...,1.0,2.0,69.0,180.0,3.0,3.0,210.0,,270.0,62.0
1,73558.0,153.0,153.0,50.0,4420.0,306.00,4.7,47.0,97.0,18.0,...,1.0,2.0,71.0,200.0,3.0,3.0,160.0,2.0,250.0,25.0
2,73559.0,11.9,11.9,113.0,9989.2,10.53,3.7,37.0,99.0,22.0,...,2.0,2.0,70.0,195.0,3.0,2.0,195.0,2.0,228.0,35.0
3,73560.0,16.0,16.0,76.0,6718.4,21.05,,,,,...,,,,,,,,,,
4,73561.0,255.0,255.0,147.0,12994.8,173.47,4.3,43.0,78.0,36.0,...,2.0,2.0,67.0,120.0,2.0,1.0,150.0,2.0,170.0,60.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10166,,,,,,,,,,,...,2.0,2.0,69.0,220.0,1.0,2.0,245.0,,245.0,61.0
10167,,,,,,,,,,,...,2.0,2.0,70.0,173.0,3.0,3.0,173.0,2.0,190.0,55.0
10169,,,,,,,,,,,...,,2.0,66.0,180.0,3.0,3.0,170.0,2.0,180.0,40.0
10170,,,,,,,,,,,...,2.0,2.0,69.0,150.0,3.0,3.0,150.0,2.0,150.0,26.0


In [193]:
get_nulls(df)

Unnamed: 0,nullPercent
SEQN,2.432459
URXUMA,19.728841
URXUMS,19.728841
URXUCR.x,19.728841
URXCRS,19.728841
...,...
WHQ040,35.559765
WHD050,35.978467
WHQ070,43.435350
WHD140,39.158608
