<h1><center>2021 BRFSS Data: Part 1</center></h1>


#### About the data:
- The dataset used in this project is the 2021 BRFSS Data.
- The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all the states in the United States and participating US territories and the Centers for Disease Control and Prevention (CDC).
- It is used to collect prevalence data among adult U.S. residents regarding their risk behaviors and preventive health practices that can affect their health status. Respondent data are forwarded to CDC to be aggregated for each state, returned with standard tabulations, and published at year's end by each state. 
- To get the database used, __[Click Here](https://www.cdc.gov/brfss/annual_data/annual_2021.html)__
- For the codebook of the database, __[Click Here](https://www.cdc.gov/brfss/annual_data/2021/pdf/codebook21_llcp-v2-508.pdf)__


#### The project is:
- __Part 1. Preparation:__ In this part, I will choose what columns will be left to analyze, change the values in them to more human-readable form according the the BRFSS codebook.<br>
- __Part 2. EDA:__ In this part, I will do some exploratory data analysis, trying to find insights from the data.<br>
- __Part 3. Statistics:__ In this part, I will do some statistical hypotheses testing to evaluate some of the insights found in part 2, I will also conduct regression analysis.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

In [2]:
df_raw = pd.read_sas('LLCP2021.XPT')

In [3]:
df_raw.sample(5)

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,PVTRESD1,COLGHOUS,STATERE1,CELPHON1,LADULT1,COLGSEX,NUMADULT,LANDSEX,NUMMEN,NUMWOMEN,RESPSLCT,SAFETIME,CTELNUM1,CELLFON5,CADULT1,CELLSEX,PVTRESD3,CCLGHOUS,CSTATE1,LANDLINE,HHADULT,SEXVAR,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,PRIMINSR,PERSDOC3,MEDCOST1,CHECKUP1,EXERANY2,BPHIGH6,BPMEDS,CHOLCHK3,TOLDHI3,CHOLMED3,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNCR,CHCOCNCR,CHCCOPD3,ADDEPEV3,CHCKDNY2,DIABETE4,DIABAGE3,HAVARTH5,ARTHEXER,ARTHEDU,LMTJOIN3,ARTHDIS2,JOINPAI2,MARITAL,EDUCA,RENTHOM1,NUMHHOL3,NUMPHON3,CPDEMO1B,VETERAN3,EMPLOY1,CHILDREN,INCOME3,PREGNANT,WEIGHT2,HEIGHT3,DEAF,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,SMOKE100,SMOKDAY2,USENOW3,ECIGNOW1,ALCDAY5,AVEDRNK3,DRNK3GE5,MAXDRNKS,FLUSHOT7,FLSHTMY3,IMFVPLA2,PNEUVAC4,HIVTST7,HIVTSTD3,FRUIT2,FRUITJU2,FVGREEN1,FRENCHF1,POTATOE1,VEGETAB2,PDIABTST,PREDIAB1,INSULIN1,BLDSUGAR,FEETCHK3,DOCTDIAB,CHKHEMO3,FEETCHK,EYEEXAM1,DIABEYE,DIABEDU,TOLDCFS,HAVECFS,WORKCFS,TOLDHEPC,TRETHEPC,PRIRHEPC,HAVEHEPC,HAVEHEPB,MEDSHEPB,HPVADVC4,HPVADSHT,TETANUS1,SHINGLE2,LCSFIRST,LCSLAST,LCSNUMCG,LCSCTSCN,HADMAM,HOWLONG,CERVSCRN,CRVCLCNC,CRVCLPAP,CRVCLHPV,HADHYST2,PSATEST1,PSATIME1,PCPSARS2,PCSTALK,HADSIGM4,COLNSIGM,COLNTES1,SIGMTES1,LASTSIG4,COLNCNCR,VIRCOLO1,VCLNTES1,SMALSTOL,STOLTEST,STOOLDN1,BLDSTFIT,SDNATES1,CNCRDIFF,CNCRAGE,CNCRTYP1,CSRVTRT3,CSRVDOC1,CSRVSUM,CSRVRTRN,CSRVINST,CSRVINSR,CSRVDEIN,CSRVCLIN,CSRVPAIN,CSRVCTL2,HOMBPCHK,HOMRGCHK,WHEREBP,SHAREBP,WTCHSALT,DRADVISE,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,CAREGIV1,CRGVREL4,CRGVLNG1,CRGVHRS1,CRGVPRB3,CRGVALZD,CRGVPER1,CRGVHOU1,CRGVEXPT,ACEDEPRS,ACEDRINK,ACEDRUGS,ACEPRISN,ACEDIVRC,ACEPUNCH,ACEHURT1,ACESWEAR,ACETOUCH,ACETTHEM,ACEHVSEX,ACEADSAF,ACEADNED,MARIJAN1,USEMRJN3,RSNMRJN2,LASTSMK2,STOPSMK2,FIREARM5,GUNLOAD,LOADULK2,RCSGENDR,RCSRLTN2,CASTHDX2,CASTHNO2,BIRTHSEX,SOMALE,SOFEMALE,TRNSGNDR,QSTVER,QSTLANG,_METSTAT,_URBSTAT,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_IMPRACE,_CHISPNC,_CRACE1,_CPRACE1,CAGEG,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT2,_LLCPWT,_RFHLTH,_PHYS14D,_MENT14D,_HLTHPLN,_HCVU652,_TOTINDA,_RFHYPE6,_CHOLCH3,_RFCHOL3,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR3,_LMTACT3,_LMTWRK3,_PRACE1,_MRACE1,_HISPANC,_RACE,_RACEG21,_RACEGR3,_RACEPRV,_SEX,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG1,_SMOKER3,_RFSMOK3,_CURECI1,DRNKANY5,DROCDY3_,_RFBING5,_DRNKWK1,_RFDRHV7,_FLSHOT7,_PNEUMO3,_AIDTST4,FTJUDA2_,FRUTDA2_,GRENDA1_,FRNCHDA_,POTADA1_,VEGEDA2_,_MISFRT1,_MISVEG1,_FRTRES1,_VEGRES1,_FRUTSU1,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1
342962,45.0,7.0,b'08092021',b'08',b'09',b'2021',1200.0,b'2021002064',2021002000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,1.0,,,,,,,,,,,1.0,9.0,30.0,88.0,88.0,7.0,3.0,2.0,1.0,1.0,3.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,2.0,1.0,2.0,2.0,2.0,3.0,,2.0,,,,,,1.0,6.0,1.0,1.0,1.0,1.0,1.0,7.0,88.0,99.0,,160.0,509.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,,,,,,,,,,,,,,,,,,,7.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,3.0,451051.0,7.202447,2.0,14.404893,1.0,,,,,,1.0,0.516469,216.011567,87.710255,9.0,3.0,1.0,1.0,9.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,3.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,12.0,2.0,77.0,6.0,69.0,175.0,7257.0,2363.0,2.0,1.0,1.0,4.0,9.0,9.0,9.0,9.0,9.0,900.0,9.0,99900.0,9.0,9.0,9.0,,,,,,,,2.0,4.0,5.397605e-79,5.397605e-79,,,9.0,9.0,1.0,1.0,1.0,1.0
433024,66.0,8.0,b'11132021',b'11',b'13',b'2021',1100.0,b'2021001386',2021001000.0,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,,1.0,2.0,3.0,2.0,4.0,88.0,88.0,,7.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,2.0,1.0,2.0,2.0,2.0,1.0,38.0,1.0,1.0,2.0,1.0,1.0,7.0,3.0,4.0,1.0,,,1.0,2.0,7.0,88.0,7.0,,190.0,411.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,,3.0,3.0,888.0,,,,1.0,102021.0,6.0,1.0,2.0,,101.0,555.0,304.0,301.0,555.0,102.0,,,2.0,201.0,101.0,2.0,3.0,1.0,2.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,88.0,,,,,,,,,,,,,,,,20.0,1.0,,,,662011.0,13.775053,1.0,13.775053,6.0,,,,,,9.0,,51.055326,15.547618,2.0,1.0,1.0,1.0,9.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,5.0,5.0,2.0,5.0,2.0,3.0,5.0,2.0,10.0,2.0,68.0,6.0,59.0,150.0,8618.0,3837.0,4.0,2.0,1.0,2.0,5.0,4.0,1.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.0,1.0,2.0,5.397605e-79,100.0,13.0,3.0,5.397605e-79,200.0,5.397605e-79,5.397605e-79,1.0,1.0,100.0,216.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79
141923,23.0,6.0,b'06052021',b'06',b'05',b'2021',1100.0,b'2021009715',2021010000.0,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,2.0,1.0,1.0,3.0,88.0,88.0,,7.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,1.0,60.0,1.0,1.0,2.0,1.0,1.0,5.0,2.0,5.0,1.0,,,1.0,1.0,2.0,88.0,8.0,,300.0,508.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,3.0,3.0,101.0,1.0,88.0,2.0,1.0,122020.0,3.0,2.0,1.0,777777.0,102.0,555.0,101.0,301.0,555.0,202.0,,,2.0,101.0,101.0,1.0,1.0,1.0,3.0,2.0,2.0,,,,,,,,,,,,,,16.0,61.0,20.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,3.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,88.0,,,,1.0,,,,,,,,,,,,22.0,1.0,2.0,2.0,,232111.0,1.252849,1.0,1.252849,1.0,9.0,,,,,9.0,,23.918364,20.26284,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,9.0,1.0,61.0,5.0,68.0,173.0,13608.0,4561.0,4.0,2.0,1.0,3.0,5.0,1.0,2.0,1.0,1.0,14.0,1.0,100.0,1.0,,,1.0,5.397605e-79,200.0,100.0,3.0,5.397605e-79,29.0,5.397605e-79,5.397605e-79,1.0,1.0,200.0,132.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79
115101,20.0,1.0,b'01162021',b'01',b'16',b'2021',1100.0,b'2021010873',2021011000.0,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,2.0,2.0,1.0,1.0,88.0,88.0,,1.0,1.0,2.0,1.0,1.0,3.0,,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,,,,,,1.0,5.0,1.0,,,3.0,2.0,1.0,2.0,10.0,,145.0,506.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,1.0,3.0,202.0,1.0,88.0,2.0,1.0,82020.0,1.0,2.0,2.0,,201.0,555.0,202.0,202.0,202.0,555.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,2.0,,4.0,22.0,1.0,1.0,1.0,,202031.0,14.786075,1.0,14.786075,1.0,2.0,1.0,1.0,1.0,140.385004,9.0,,114.502124,142.185529,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,3.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,39.0,3.0,66.0,168.0,6577.0,2340.0,2.0,1.0,3.0,3.0,6.0,4.0,1.0,1.0,1.0,7.0,1.0,47.0,1.0,,,2.0,5.397605e-79,14.0,29.0,29.0,29.0,5.397605e-79,5.397605e-79,5.397605e-79,1.0,1.0,14.0,87.0,2.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79
410556,53.0,7.0,b'07212021',b'07',b'21',b'2021',1100.0,b'2021009323',2021009000.0,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,,1.0,2.0,2.0,2.0,3.0,15.0,15.0,15.0,1.0,2.0,2.0,1.0,1.0,3.0,,2.0,1.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,,,,,,1.0,6.0,1.0,,,1.0,2.0,3.0,88.0,11.0,2.0,192.0,507.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,3.0,3.0,202.0,1.0,88.0,2.0,2.0,,,2.0,2.0,,330.0,305.0,320.0,301.0,555.0,315.0,2.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,4.0,20.0,1.0,1.0,1.0,,532021.0,77.300129,1.0,77.300129,1.0,,,,,,9.0,,606.266602,339.20351,1.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,3.0,2.0,3.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,6.0,1.0,49.0,4.0,67.0,170.0,8709.0,3007.0,4.0,2.0,1.0,4.0,7.0,4.0,1.0,1.0,1.0,7.0,1.0,47.0,1.0,,,2.0,17.0,100.0,67.0,3.0,5.397605e-79,50.0,5.397605e-79,5.397605e-79,1.0,1.0,117.0,120.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79


In [4]:
# Choosing what coumns to keep and renaming them to more understandable names.

Columns_To_Keep = ['_URBSTAT', 'SEXVAR', '_AGE_G', '_EDUCAG', 'MARITAL', '_INCOMG1', 'VETERAN3', 'HTM4', 'WTKG3', '_BMI5', '_BMI5CAT', '_SMOKER3', 'AVEDRNK3', '_HLTHPLN', '_RFHYPE6', 'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3', 'DIABETE4', 'DIABAGE3', 'BLDSUGAR', 'CHKHEMO3', 'INSULIN1', 'DIABEYE', 'FEETCHK', '_LTASTH1', 'CHCCOPD3', '_DRDXAR3', 'CHCKDNY2', 'ADDEPEV3']
Columns_New_Name = ['Urban_Rural', 'Gender', 'Age_Category', 'Education', 'Martial_Status', 'Income_Category', 'Veteran', 'Height', 'Weight', 'BMI', 'BMI_Category', 'Smoking' ,'Alcohol_Drinks', 'Insurance', 'HTN', 'MI', 'CHD', 'Stroke', 'DM', 'DM_Age', 'BGM_Weekly', 'A1C', 'Insulin', 'Retinopathy_Counseled', 'Feet_Check', 'Asthma', 'COPD', 'Arthritis', 'Kidney_Disease', 'Depression']

df = df_raw[Columns_To_Keep].copy()
df.columns = Columns_New_Name

In [5]:
# Change the values in the database according to the BRFSS codebook.

# Create a list for each column:

Urban_Rural_Values = {1:'Urban counties', 2:'Rural counties'}
Gender_Values = {1:'Male', 2:'Female'}
Age_Category_Values = {1:'18 to 24', 2:'25 to 34', 3:'35 to 44', 4:'45 to 54', 5:'55 to 64', 6:'65 or older'}
Education_Values = {1:'Did not graduate High School', 2:'Graduated High School', 3:'Attended College or Technical School', 4:'Graduated from College or Technical School', 9:np.nan}
Martial_Status_Values = {1:'Married', 2:'Divorced', 3:'Widowed', 4:'Separated', 5:'Never married', 6:'A member of an unmarried couple', 9:np.nan}
Income_Category_Values = {1:'0 to $15,000', 2:'$15,000 to < $25,000', 3:'$25,000 to < $35,000', 4:'$35,000 to < $50,000', 5:'$50,000 to < $100,000', 6:'$100,000 to < $200,000', 7:'$200,000 or more', 9:np.nan}
Veteran_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
BMI_Category_Values = {1:'Underweight', 2:'Normal Weight', 3:'Overweight', 4:'Obese'}
Smoking_Values = {1:'Current smoker', 2:'Current smoker', 3:'Former smoker', 4:'Never smoked', 9:np.nan}
Alcohol_Drinks_Values = {88:0, 77:np.nan, 99:np.nan}
Insurance_Values = {1:'Yes', 2:'No', 9:np.nan}
HTN_Values = {1:'No', 2:'Yes', 9:np.nan}
MI_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
CHD_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
Stroke_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
DM_Values = {1:'Yes', 2:'No', 3:'No', 4:'No', 7:np.nan, 9:np.nan}
Insulin_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
Retinopathy_Counseled_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
Asthma_Values =  {1:'No', 2:'Yes', 9:np.nan}
COPD_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
Arthritis_Values = {1:'Yes', 2:'No'}
Kidney_Disease_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}
Depression_Values = {1:'Yes', 2:'No', 7:np.nan, 9:np.nan}


# Use the above lists to rename te values in the columns

df['Urban_Rural'].replace(Urban_Rural_Values, inplace=True)
df['Gender'].replace(Gender_Values, inplace=True)
df['Age_Category'].replace(Age_Category_Values, inplace=True)
df['Education'].replace(Education_Values, inplace=True)
df['Martial_Status'].replace(Martial_Status_Values, inplace=True)
df['Income_Category'].replace(Income_Category_Values, inplace=True)
df['Veteran'].replace(Veteran_Values, inplace=True)
df['BMI_Category'].replace(BMI_Category_Values, inplace=True)
df['Smoking'].replace(Smoking_Values, inplace=True)
df['Alcohol_Drinks'].replace(Alcohol_Drinks_Values, inplace=True)
df['Insurance'].replace(Insurance_Values, inplace=True)
df['HTN'].replace(HTN_Values, inplace=True)
df['MI'].replace(MI_Values, inplace=True)
df['CHD'].replace(CHD_Values, inplace=True)
df['Stroke'].replace(Stroke_Values, inplace=True)
df['DM'].replace(DM_Values, inplace=True)
df['Insulin'].replace(Insulin_Values, inplace=True)
df['Retinopathy_Counseled'].replace(Retinopathy_Counseled_Values, inplace=True)
df['Asthma'].replace(Asthma_Values, inplace=True)
df['COPD'].replace(COPD_Values, inplace=True)
df['Arthritis'].replace(Arthritis_Values, inplace=True)
df['Kidney_Disease'].replace(Kidney_Disease_Values, inplace=True)
df['Depression'].replace(Depression_Values, inplace=True)

In [6]:
# Change column type of Height, DM_Age, A1C, Feet_Check columns
df['Height'] = df['Height'].astype('float64')
df['DM_Age'] = df['DM_Age'].replace({98:np.nan, 99:np.nan}).astype('float64')
df['A1C'] = df['A1C'].replace({88:0, 98:np.nan, 99:np.nan, 77:np.nan}).astype('float64')
df['Feet_Check'] = df['Feet_Check'].replace({88:0, 99:np.nan, 77:np.nan}).astype('float64')

# Divide Weight and BMI by 100
df['Weight'] = df['Weight'] / 100
df['BMI'] = df['BMI'] / 100

In [7]:
# Change all values of blood glucose measurment to make them represent number of weekly measuments

df.drop(df[(df['BGM_Weekly'] > 111) & (df['BGM_Weekly'] < 200)].index, inplace=True)
df.drop(df[(df['BGM_Weekly'] > 251) & (df['BGM_Weekly'] < 300)].index, inplace=True)

for i in range(len(df)):
    if df['BGM_Weekly'].iloc[i] in range(101, 200):
        df['BGM_Weekly'].iloc[i] = (df['BGM_Weekly'].iloc[i] - 100) * 7
    elif df['BGM_Weekly'].iloc[i] in range(201, 300):
        df['BGM_Weekly'].iloc[i] = df['BGM_Weekly'].iloc[i] - 200
    elif df['BGM_Weekly'].iloc[i] in range(300, 400):
        df['BGM_Weekly'].iloc[i] = (df['BGM_Weekly'].iloc[i] - 300) / 4
    elif df['BGM_Weekly'].iloc[i] in range(400, 500):
        df['BGM_Weekly'].iloc[i] = (df['BGM_Weekly'].iloc[i] - 400) / 52
    else:
        df['BGM_Weekly'].iloc[i] = np.nan

In [8]:
# Remove duplicated rows

print(df.duplicated().sum())
df.drop_duplicates(inplace=True)

19265


In [9]:
# Create a function to check for any outliers and replace them with upper/lower limit

def outliers(dataframe, column):
    """This function takes the name of a dataframe and a column name as an input.
    It checks if the given column contains any outliers in its values.
    If there is indeed outliers, it replaces these outliers with the threshold (either upper or lower limits).
    If there is no outliers in the column, no changes will happen.
    """
    q1 = dataframe[column].quantile(0.25)
    q3 = dataframe[column].quantile(0.75)
    iqr = q3 - q1
    lower_limit = q1 - (1.5 * iqr)
    upper_limit = q3 + (1.5 * iqr)
    if dataframe[(dataframe[column] > upper_limit) | (dataframe[column] < lower_limit)].any(axis=None):
        dataframe.loc[(dataframe[column] < lower_limit), column] = lower_limit
        dataframe.loc[(dataframe[column] > upper_limit), column] = upper_limit
        print(f"{column} column had outliers; these outliers were replaced with upper/lower limits.")
    else:
        print(f"{column} column did not have and outliers; no changes were made.")

help(outliers)

Help on function outliers in module __main__:

outliers(dataframe, column)
    This function takes the name of a dataframe and a column name as an input.
    It checks if the given column contains any outliers in its values.
    If there is indeed outliers, it replaces these outliers with the threshold (either upper or lower limits).
    If there is no outliers in the column, no changes will happen.



In [10]:
# Choose which columns to apply the function to
num_cols = [col for col in df.columns if df[col].dtypes != 'O']

# Perform the function for all the selected columns
for col in num_cols:
    outliers(df, col)

# Have a look over the statistical analyses of the dataframe after removing outliers
df.describe().T

Height column had outliers; these outliers were replaced with upper/lower limits.
Weight column had outliers; these outliers were replaced with upper/lower limits.
BMI column had outliers; these outliers were replaced with upper/lower limits.
Alcohol_Drinks column had outliers; these outliers were replaced with upper/lower limits.
DM_Age column had outliers; these outliers were replaced with upper/lower limits.
BGM_Weekly column had outliers; these outliers were replaced with upper/lower limits.
A1C column had outliers; these outliers were replaced with upper/lower limits.
Feet_Check column had outliers; these outliers were replaced with upper/lower limits.


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Height,401584.0,170.29314,10.631788,140.5,163.0,170.0,178.0,200.5
Weight,386935.0,83.009184,20.26222,27.225,68.04,81.65,95.25,136.065
BMI,380722.0,28.441652,5.966126,12.72,24.21,27.44,31.87,43.36
Alcohol_Drinks,199971.0,2.080892,1.391714,0.0,1.0,2.0,3.0,6.0
DM_Age,52592.0,49.699859,14.55059,10.0,40.0,50.0,60.0,90.0
BGM_Weekly,18486.0,9.78308,8.932008,0.019231,2.0,7.0,14.0,32.0
A1C,20557.0,2.521404,1.632547,0.0,1.0,2.0,4.0,8.5
Feet_Check,21030.0,1.852544,1.699803,0.0,1.0,1.0,3.0,6.0


In [11]:
# Have a final look at a randomly selected 10 rows of the dataframe before exporting it as a CSV file

df.sample(10)

Unnamed: 0,Urban_Rural,Gender,Age_Category,Education,Martial_Status,Income_Category,Veteran,Height,Weight,BMI,BMI_Category,Smoking,Alcohol_Drinks,Insurance,HTN,MI,CHD,Stroke,DM,DM_Age,BGM_Weekly,A1C,Insulin,Retinopathy_Counseled,Feet_Check,Asthma,COPD,Arthritis,Kidney_Disease,Depression
218739,Urban counties,Male,55 to 64,Did not graduate High School,Married,"$50,000 to < $100,000",No,175.0,95.25,31.01,Obese,Never smoked,,Yes,No,No,No,No,No,,,,,,,No,No,No,No,No
35412,Rural counties,Male,65 or older,Graduated High School,Divorced,,No,168.0,66.22,23.56,Normal Weight,Former smoker,,Yes,No,No,No,No,No,,,,,,,No,No,No,No,No
58361,Urban counties,Male,45 to 54,Graduated High School,Never married,"$15,000 to < $25,000",No,165.0,107.05,39.27,Obese,Never smoked,,,No,No,No,No,Yes,37.0,,,,,,Yes,No,Yes,No,No
271760,Urban counties,Male,45 to 54,Graduated High School,Married,"$50,000 to < $100,000",No,183.0,133.36,39.87,Obese,Never smoked,,Yes,Yes,No,No,No,Yes,49.0,,,,,,No,No,Yes,No,No
40424,Urban counties,Male,25 to 34,Graduated High School,,"$15,000 to < $25,000",No,188.0,86.18,24.39,Normal Weight,Former smoker,2.0,,No,No,No,No,No,,,,,,,No,No,No,No,No
107896,Urban counties,Female,35 to 44,Graduated from College or Technical School,Never married,"$50,000 to < $100,000",No,163.0,108.86,41.2,Obese,Never smoked,2.0,Yes,Yes,No,No,No,Yes,39.0,,,,,,No,No,Yes,No,No
128412,Urban counties,Male,45 to 54,Attended College or Technical School,Separated,"$100,000 to < $200,000",No,183.0,74.84,22.38,Normal Weight,Former smoker,,Yes,Yes,No,No,No,No,,,,,,,No,No,No,No,No
347060,Rural counties,Male,55 to 64,Graduated High School,Widowed,"0 to $15,000",No,168.0,99.79,35.51,Obese,,,Yes,Yes,No,No,No,Yes,,,,,,,No,No,Yes,No,Yes
11404,Urban counties,Female,65 or older,Attended College or Technical School,Divorced,"$15,000 to < $25,000",No,152.0,53.52,23.05,Normal Weight,Never smoked,2.0,Yes,Yes,No,No,,No,,,,,,,No,No,Yes,No,No
22822,Urban counties,Female,65 or older,Graduated from College or Technical School,Married,"$50,000 to < $100,000",No,155.0,74.84,31.18,Obese,Never smoked,,Yes,Yes,No,No,No,Yes,38.0,7.0,3.0,Yes,No,3.0,No,No,No,No,No


In [12]:
# Exporting the results to a csv file so it can be used in Parts 2 and 3

df.to_csv('LLCP2021_Prepared.csv', index=False)