#**CS 133 Term Project**



**Project Deliverables**

1. Each project team will choose one of the following data sets for their project. Create a Google Colab notebook to perform the data analysis and visualization for this project. Create and answer at least 5 unique questions using different types of plots to help you understand the data. ~~One of the plots must be a map visualization. Besides the map, one of the plots must be an interactive plot.~~ You can create additional categorical columns or reshape your data to help you understand the data.

2. Each data set has a prediction goal. Create a test set and a training set using the original data set.

3. Follow the steps that we use in Lectures to prepare the data and pipeline for training a few ML classification or regression models that can perform the prediction as indicated in the data set description. Use any strategy that you see fit. Use N-fold cross-validation to evaluate the performance of each model.

4. Use the appropriate metrics to evaluate the models' performance. Select the best one for fine-tuning.

5. Test your best and fine-tuned ML model using the test set.

6. Create a Google slideshow presentation (~15 min long) in which your group with explain the following:

   - How the data visualization help you with choosing certain strategies in developing the ML training pipeline
   - What strategy is used to create test/train data
   - What ML models are chosen, and why are they suitable for this analysis
the performance of all trained models (including the performance metrics)
   - Show the prediction performance of the best ML model using the test set.
   - To wrap up, discuss the challenges you have encountered and/or any other thoughts you have about this project.

7. Submit the urls for the Colab notebook and the Google slides in Canvas.



---

#**Data Sets:**

#**US Health Insurance**

<img src="https://www.usnews.com/object/image/0000017c-e0b1-dfe4-a77d-fef3754d0000/211102hckhnplans-stock.jpg?update-time=1635858623610&size=responsive970" width=600>   
<font color=gray size=2> Image source: U.S. News</font>


SAHIE is the Small Area Health Insurance Estimates program of the U.S. Census Bureau. SAHIE estimates of health insurance coverage for counties and states. SAHIE publishes STATE and COUNTY estimates of population with and without health insurance coverage.

Here are the descriptions of this dataset:
- `year` - Year of Estimate
- `statefips` - Unique FIPS code for each state
- `countyfips` - Unique FIPS code for each county within a state
- `geocat` - Geography category, 40 – State geographic identifier, 50 – County geographic identifier
- `agecat` - Age category
- `racecat` - Race category
- `sexcat` - Sex category
- `iprcat` - Income category
- `NIPR` - Number in demographic group for income category
- `NUI` - Number uninsured
- `NIC` - Number insured
- `PCTUI` - Percent uninsured in demographic group for income category
- `PCTIC` - Percent insured in demographic group for income category
- `PCTELIG` - Percent uninsured in demographic group for all income levels
- `PCTLIIC` - Percent insured in demographic group for all income levels
- `state_name` - State name

Data Source Credit: Small Area Health Insurance Estimates Program, U.S. Census Bureau.

###**Prediction**
You are tasked to build an ML model to predict the percent uninsured in demographic group for all income levels.


In [None]:
import pandas as pd
data = 'https://raw.githubusercontent.com/csbfx/cs133/main/us_health_insurance_2020.csv'
df = pd.read_csv(data)
df

Unnamed: 0,year,statefips,countyfips,geocat,agecat,racecat,sexcat,iprcat,NIPR,NUI,NIC,PCTUI,PCTIC,PCTELIG,PCTLIIC,state_name
0,2020,1,0,40,18 to 64 years,"White alone, not Hispanic",Male,At or below 200% of poverty,204748,58255,146493,28.5,71.5,6.3,15.8,Alabama ...
1,2020,1,0,40,18 to 64 years,"White alone, not Hispanic",Male,At or below 250% of poverty,274778,72162,202616,26.3,73.7,7.8,21.9,Alabama ...
2,2020,1,0,40,18 to 64 years,"White alone, not Hispanic",Male,At or below 138% of poverty,123241,38484,84757,31.2,68.8,4.2,9.2,Alabama ...
3,2020,1,0,40,18 to 64 years,"White alone, not Hispanic",Male,At or below 400% of poverty,488459,100802,387657,20.6,79.4,10.9,41.9,Alabama ...
4,2020,1,0,40,18 to 64 years,"White alone, not Hispanic",Male,Between 138% - 400% of poverty,365218,62318,302900,17.1,82.9,6.7,32.7,Alabama ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6115,2020,56,0,40,21 to 64 years,Hispanic (any race),Female,At or below 200% of poverty,5399,2414,2985,44.7,55.3,16.4,20.3,Wyoming ...
6116,2020,56,0,40,21 to 64 years,Hispanic (any race),Female,At or below 250% of poverty,6950,2980,3970,42.9,57.1,20.2,27.0,Wyoming ...
6117,2020,56,0,40,21 to 64 years,Hispanic (any race),Female,At or below 138% of poverty,3488,1615,1873,46.3,53.7,11.0,12.7,Wyoming ...
6118,2020,56,0,40,21 to 64 years,Hispanic (any race),Female,At or below 400% of poverty,10249,3907,6342,38.1,61.9,26.5,43.1,Wyoming ...


---

##**2015 US Police Killings**

<img src="https://www.ft.com/__origami/service/image/v2/images/raw/https://d1e00ek4ebabms.cloudfront.net/production/8710fc45-0f60-4080-860b-00f866a6fc81.jpg?source=next&fit=scale-down&quality=highest&width=1440&dpr=1" width=600>  
<font size=2 color=gray>Image credit: Financial Times</font>

This dataset is behind the story [Where Police Have Killed Americans In 2015](https://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/). This dataset also contains the  census data from the American Community Survey. Census data was calculated at the tract level from the 2015 5-year American Community Survey using the tables S0601 (demographics), S1901 (tract-level income and poverty), S1701 (employment and education) and DP03 (county-level income). Census tracts were determined by geocoding addresses to latitude/longitude using the Bing Maps and Google Maps APIs and then overlaying points onto 2014 census tracts.

Here are the descriptions of the data:

- `age` - Age of deceased
- `gender` - Gender of deceased
- `raceethnicity` - Race/ethnicity of deceased
- `city` - City where incident occurred
- `state` - State where incident occurred
- `latitude` - Latitude, geocoded from address
- `longitude` - Longitude, geocoded from address
- `lawenforcementagency` - Agency involved in incident
- `cause` - Cause of death
- `armed` - How/whether deceased was armed
- `region` - Region of the US where the incident occured
- `agegroup` - Age-group that the deceased belongs to
- `share_white`- Share of pop that is non-Hispanic white
- `share_black` - Share of pop that is black (alone, not in combination)
- `share_hispanic` - Share of pop that is Hispanic/Latino (any race)
- `p_income` - Tract-level median personal income
- `h_income` - Tract-level median household income
- `county_income` - County-level median household income
- `pov` - Tract-level poverty rate
- `urate` - Tract-level unemployment rate
- `college` - Share of 25+ pop with BA or higher

Data Source Credit: [FiveThirtyEight](https://fivethirtyeight.com/)

###**Prediction**

You are tasked to build a ML model to predict if the deceased is armed or not.

In [None]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

data='https://raw.githubusercontent.com/csbfx/cs133/main/police_killing.csv'
df = pd.read_csv(data)
df

Unnamed: 0,age,gender,raceethnicity,month,year,city,state,latitude,longitude,lawenforcementagency,...,share_white,share_black,share_hispanic,p_income,h_income,county_income,comp_income,pov,urate,college
0,16,Male,Black,February,2015,Millbrook,AL,32.529577,-86.362829,Millbrook Police Department,...,60.5,30.5,5.6,28375,51367.0,54766,0.937936,14.1,0.097686,0.168510
1,27,Male,White,April,2015,Pineville,LA,31.321739,-92.434860,Rapides Parish Sheriff's Office,...,53.8,36.2,0.5,14678,27972.0,40930,0.683411,28.8,0.065724,0.111402
2,26,Male,White,March,2015,Kenosha,WI,42.583560,-87.835710,Kenosha Police Department,...,73.8,7.7,16.8,25286,45365.0,54930,0.825869,14.6,0.166293,0.147312
3,25,Male,Hispanic/Latino,March,2015,South Gate,CA,33.939298,-118.219463,South Gate Police Department,...,1.2,0.6,98.8,17194,48295.0,55909,0.863814,11.7,0.124827,0.050133
4,29,Male,White,March,2015,Munroe Falls,OH,41.148575,-81.429878,Kent Police Department,...,92.5,1.4,1.7,33954,68785.0,49669,1.384868,1.9,0.063550,0.403954
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
462,18,Male,Black,April,2015,Portsmouth,VA,36.829014,-76.341438,Portsmouth Police Department,...,40.9,53.8,0,25262,27418.0,46166,0.593900,35.2,0.152047,0.120553
463,28,Male,Native American,April,2015,Tonasket,WA,48.708542,-119.436829,US Forest Service,...,74.5,0.4,20.2,18470,35608.0,40368,0.882085,27.3,0.133650,0.174525
464,52,Male,White,March,2015,Gaston,NC,35.205776,-81.240669,Gaston County Police Department,...,83.2,10.1,0.3,21175,38200.0,42017,0.909156,28.5,0.256150,0.072764
465,38,Female,Black,February,2015,Oakland,CA,37.827129,-122.284492,Emeryville Police Department,...,21.7,24.9,37.1,26971,63052.0,72112,0.874362,23.9,0.069601,0.396476


---

#**World Happiness 2005 - 2021**

<img src="https://ggsc.s3.amazonaws.com/images/uploads/Stick_figure_map_of_world.jpg" width=600>  
<font color=gray size=2>Image Credit: UC Berkeley</font>

The World Happiness Report is a publication of the Sustainable Development Solutions Network, powered by the Gallup World Poll data. Life evaluations from the Gallup World Poll provide the basis for the annual happiness rankings that have always sparked widespread interest. We can learn what factors contribute to the secrets of life in the happiest countries.

Here are the descriptions of the data:

- `Country_name` - Name of the country
- `year` - This dataset contains survey data of the state of global happiness from 2005 to 2021.
- `Life_Ladder` - Also know as, Cantril ladder. Respondents are asked to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale.
- `Log_GDP_per_capita` - is in terms of Purchasing Power Parity (PPP) adjusted to constant 2011 international dollars, taken from the World Development Indicators (WDI) released by the World Bank on November 14, 2018. The equation uses the natural log of GDP per capita, as this form fits the data significantly better than GDP per capita.
- `Social_support` - is the national average of the binary responses (either 0 or 1) to the Gallup World Poll (GWP) question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
- `Healthy_life_expectancy_at_birth` - are constructed based on data from the World Health Organization (WHO) Global Health Observatory data repository, with data available for 2005, 2010, 2015, and 2016. To match this report’s sample period, interpolation and extrapolation are used.
- `Freedom_to_make_life_choices` - is the national average of binary responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
- `Generosity` - is the residual of regressing the national average of GWP responses to the question “Have you donated money to a charity in the past month?” on GDP per capita.
- `Perceptions_of_corruption` - are the average of binary answers to two GWP questions: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” Where data for government corruption are missing, the perception of business corruption is used as the overall corruption-perception measure.
- `Positive_affect` - is defined as the average of previous-day affect measures for happiness, laughter, and enjoyment for GWP waves 3-7 (years 2008 to 2012, and some in 2013). It is defined as the average of laughter and enjoyment for other waves where the happiness question was not asked. The general form for the affect questions is: Did you experience the following feelings during a lot of the day yesterday?
- `Negative_affect` - is defined as the average of previous-day affect measures for worry, sadness, and anger for all waves.
- `Confidence_in_national_government` - level of trust in government


Data source credit: [World Happiness Report](https://worldhappiness.report/)


###**Prediction**

You are tasked to build a ML model to predict the Life Ladder - the measure of happiness.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
data = 'https://raw.githubusercontent.com/csbfx/cs133/main/world_happiness_2005-2021.csv'
df = pd.read_csv(data)
df

Unnamed: 0,Country_name,year,Life_Ladder,Log_GDP_per_capita,Social_support,Healthy_life_expectancy_at_birth,Freedom_to_make_life_choices,Generosity,Perceptions_of_corruption,Positive_affect,Negative_affect,Confidence_in_national_government
0,Afghanistan,2008,3.723590,7.302574,0.450662,50.500000,0.718114,0.173169,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,2009,4.401778,7.472446,0.552308,50.799999,0.678896,0.195469,0.850035,0.481421,0.237092,0.611545
2,Afghanistan,2010,4.758381,7.579183,0.539075,51.099998,0.600127,0.125859,0.706766,0.516907,0.275324,0.299357
3,Afghanistan,2011,3.831719,7.552006,0.521104,51.400002,0.495901,0.167723,0.731109,0.479835,0.267175,0.307386
4,Afghanistan,2012,3.782938,7.637953,0.520637,51.700001,0.530935,0.241247,0.775620,0.613513,0.267919,0.435440
...,...,...,...,...,...,...,...,...,...,...,...,...
2084,Zimbabwe,2017,3.638300,8.241609,0.754147,52.150002,0.752826,-0.113937,0.751208,0.733641,0.224051,0.682647
2085,Zimbabwe,2018,3.616480,8.274620,0.775388,52.625000,0.762675,-0.084747,0.844209,0.657524,0.211726,0.550508
2086,Zimbabwe,2019,2.693523,8.196998,0.759162,53.099998,0.631908,-0.081540,0.830652,0.658434,0.235354,0.456455
2087,Zimbabwe,2020,3.159802,8.117733,0.717243,53.575001,0.643303,-0.029376,0.788523,0.660658,0.345736,0.577302


---

#**SF Bay Area House Prices**

<img src="https://www.worldpropertyjournal.com/news-assets/San-Francisco-homes-california-keyimage.jpg" width=600>  
<font size=2 color=gray>Image Credit: World Property Journal</font>  


Bay Area real estate prices have been rapidly appreciating since 2012. Over the last decade, median property values have more than doubled, and some areas of the Bay Area have more than tripled!

The rapidly escalating prices are great for those that have already bought into the market. For potential homebuyers, these escalating prices result in getting less home for your dollar (and/or having to increase your budget), and gives pause as to not wanting to buy at the peak of the market.

Here is a data set of over 7,000 active listings from June 2019 containing factors influencing home prices across the region, including number of bedrooms and bathrooms, home size, lot size, school quality, and commute times.

Here are the descriptions of the data:

- `Address` - the adddress of the house
- `City` - the city the house is at
- `State` - California, this data set is from the Bay Area
- `Zip` - postal zip code
- `Price` - listing price of the house
- `Beds` - number of bedrooms
- `Baths` - number of bathrooms
- `Home size` - the square footage of the house
- `Lot size` - the square footage of the lot
- `Latitude` - latitude coordinate
- `Longitude` - longitude coodinate
- `SF time` - the commute time by car at 8 AM to San Francisco
- `PA time` - the commute time by car at 8 AM to commute to Palo Alto
- `School score` - the quality of the schools in the neighborhood
- `Commute time` - the commute time by car at 8 AM to the general Bay Area.

Data Source Credit: Michael Boles

###**Prediction**

You are tasked to build a ML model to predict the price of the homes in the SF Bay Area.

In [None]:
import pandas as pd
data='https://raw.githubusercontent.com/csbfx/cs133/main/sf_bayarea_house_prices.csv'
df = pd.read_csv(data)
df

Unnamed: 0,Address,City,State,Zip,Price,Beds,Baths,Home size,Lot size,Latitude,Longitude,SF time,PA time,School score,Commute time
0,2412 Palmer Ave,Belmont,CA,94002,1459000,3,2.0,1360.0,5001.0,37.516781,-122.304623,63,33,77.9,33
1,1909 Hillman Ave,Belmont,CA,94002,1595000,4,2.0,2220.0,3999.0,37.521972,-122.294079,63,33,77.9,33
2,641 Waltermire St,Belmont,CA,94002,899999,2,1.0,840.0,4234.0,37.520233,-122.273144,63,33,77.9,33
3,2706 Sequoia Way,Belmont,CA,94002,1588000,3,2.0,1860.0,5210.0,37.520192,-122.309437,63,33,77.9,33
4,1568 Winding Way,Belmont,CA,94002,1999000,4,3.5,2900.0,16117.2,37.524280,-122.291241,63,33,77.9,33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7140,The Davis,Mountain House,CA,95391,603990,5,3.0,2327.0,,37.756444,-121.547719,120,125,65.3,120
7141,The Berkeley,Mountain House,CA,95391,619990,5,4.0,2410.0,,37.756444,-121.547719,120,125,65.3,120
7142,Geranium,Mountain House,CA,95391,666340,5,4.0,2486.0,,37.764721,-121.537761,120,125,65.3,120
7143,The Pepperdine,Mountain House,CA,95391,659990,5,4.0,2856.0,,37.756444,-121.547719,120,125,65.3,120


---

##**COVID19 US Statistic 2021**

<img src="https://s3-prod.modernhealthcare.com/s3fs-public/covid-us-map-icons_i.png" width=600>  
<font color=gray size=2>Image Credit: Modern Healthcare</font>  

This dataset contains United States COVID19 statistics from December 2020 to December 2021 collected from the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering ([JHU CSSE](https://github.com/CSSEGISandData/COVID-19)). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

Here are the data descriptions:
- `Province_State` - The name of the State within the USA.
- `Country_Region` - The name of the Country (US).
- `Lat` - Latitude.
- `Long_` - Longitude.
- `Confirmed` - Aggregated case count for the state.
- `Deaths` - Aggregated death toll for the state.
- `Recovered` - Aggregated Recovered case count for the state.
- `Active` - Aggregated confirmed cases that have not been resolved (Active - `cases` = total cases - total recovered - total deaths).
- `FIPS` - Federal Information Processing Standards code that uniquely - `identifies counties within the USA.
- `Incident_Rate` - cases per 100,000 persons.
- `Total_Test_Results` - Total number of people who have been tested.
- `People_Hospitalized` - Total number of people hospitalized. (Nullified on Aug 31, see Issue #3083)
- `Case_Fatality_Ratio` - Number recorded deaths * 100/ Number confirmed cases.
- `UID` - Unique Identifier for each row entry.
- `Testing_Rate` - Total test results per 100,000 persons. The "total test results" are equal to "Total test results (Positive + Negative)" from COVID Tracking Project.
- `Date` - The day the data are recorded.

Data Source Credit: Johns Hopkins University Center for Systems Science and Engineering

###**Prediction**
You are tasked to build a ML model to predict the Case Fatality Ratio.

In [None]:
import pandas as pd
data = 'https://raw.githubusercontent.com/csbfx/cs133/main/covid19_2020-2021.csv'
df = pd.read_csv(data)
df

Unnamed: 0,Province_State,Lat,Long_,Confirmed,Deaths,Recovered,Active,FIPS,Incident_Rate,Total_Test_Results,Case_Fatality_Ratio,UID,Testing_Rate,Date
0,Alabama,32.3182,-86.9023,361226,4827,202137.0,154262.0,1.0,7367.170523,,1.336283,84000001.0,,2020-12-31
1,Alaska,61.3707,-152.4044,47014,206,7165.0,39643.0,2.0,6426.672317,1275750.0,0.438167,84000002.0,174391.185778,2020-12-31
2,American Samoa,-14.2710,-170.1320,0,0,,,60.0,0.000000,2140.0,,16.0,3846.084722,2020-12-31
3,Arizona,33.7298,-111.4312,520207,8864,75981.0,435362.0,4.0,7146.960103,5059770.0,1.703937,84000004.0,38945.764755,2020-12-31
4,Arkansas,34.9697,-92.3731,225138,3676,199247.0,22215.0,5.0,7460.325455,2051488.0,1.632776,84000005.0,67979.497674,2020-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21223,Virginia,37.7693,-78.1700,1118518,15587,,,51.0,13104.276377,11179014.0,1.393540,84000051.0,130970.524464,2021-12-31
21224,Washington,47.4009,-121.4905,849075,9853,,,53.0,11150.189504,9958651.0,1.160439,84000053.0,130778.607132,2021-12-31
21225,West Virginia,38.4912,-80.9545,328162,5336,,,54.0,18311.109524,4938024.0,1.626026,84000054.0,275536.772374,2021-12-31
21226,Wisconsin,44.2685,-89.6165,1120663,11173,,,55.0,19247.328523,13355370.0,0.996999,84000055.0,0.000023,2021-12-31


---

##**Autism in children**

<img src="https://vcresearch.berkeley.edu/sites/default/files/inline-images/autisticbrain750.jpg" width=600>  
<font size=2 color=gray>Image Credit: UC Berkeley</font>

Autistic Spectrum Disorder (ASD) is a neurodevelopment condition associated with significant healthcare costs, and early diagnosis can significantly reduce these. Unfortunately, waiting times for an ASD diagnosis are lengthy and procedures are not cost effective. The economic impact of autism and the increase in the number of ASD cases across the world reveals an urgent need for the development of easily implemented and effective screening methods. Therefore, a time-efficient and accessible ASD screening is imminent to help health professionals and inform individuals whether they should pursue formal clinical diagnosis. The rapid growth in the number of ASD cases worldwide necessitates datasets related to behaviour traits. However, such datasets are rare making it difficult to perform thorough analyses to improve the efficiency, sensitivity, specificity and predictive accuracy of the ASD screening process. Presently, very limited autism datasets associated with clinical or screening are available and most of them are genetic in nature. In this dataset, there are ten behavioural features ([AQ-10-Child])(https://autismhampshire.org.uk/assets/uploads/AQ10-Child.pdf) plus ten individuals characteristics that have proved to be effective in detecting the ASD cases from controls in behaviour science.

Here are the descriptions of the dataset:

- `A1_Score` - AQ-10-Child Question 1 Answer Binary (0, 1)
- `A2_Score` - AQ-10-Child Question 2 Answer Binary (0, 1)
- `A3_Score` - AQ-10-Child Question 3 Answer Binary (0, 1)
- `A4_Score` - AQ-10-Child Question 4 Answer Binary (0, 1)
- `A5_Score` - AQ-10-Child Question 5 Answer Binary (0, 1)
- `A6_Score` - AQ-10-Child Question 6 Answer Binary (0, 1)
- `A7_Score` - AQ-10-Child Question 7 Answer Binary (0, 1)
- `A8_Score` - AQ-10-Child Question 8 Answer Binary (0, 1)
- `A9_Score` - AQ-10-Child Question 9 Answer Binary (0, 1)
- `A10_Score` - AQ-10-Child Question 10 Answer Binary (0, 1)
- `age` - Age Number
- `gender` - Gender Male (m) or Female (f)
- `ethnicity` - Ethnicity
- `jundice` - Born with jaundice Boolean (yes or no)
- `austim` - Whether any immediate family member has a Pervasive developmental disorder (yes or no)
- `country_of_res` - Country of residence
- `used_app_before` - Whether the user has used a screening app (yes or no)
- `total_score` - Sum of scores from the 10 questions. If the individual scores 6 or above, they should seek a specialist diagnostic assessment.
- `age_desc` - Used the screening app before (yes or no)
- `relation` - Who is completing the test String Parent, self, caregiver, medical staff, clinician, etc.
- `ASD` - Case of autism (YES or NO)

Data Source Credit: Department of Digital Technology, Manukau Institute of Technology, Auckland, New Zealand

###**Prediction**
You are tasked to build a ML model to predict whether a patient has autism spectrum disorder (ASD) or not.






















In [None]:
import pandas as pd
data="https://raw.githubusercontent.com/csbfx/cs133/main/autism_child.csv"
df = pd.read_csv(data, sep=',')
df

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,gender,ethnicity,jundice,austim,country_of_res,used_app_before,total_score,age_desc,relation,ASD
0,1,1,0,0,1,1,0,1,0,0,...,m,Others,no,no,Jordan,no,5,4-11 years,Parent,NO
1,1,1,0,0,1,1,0,1,0,0,...,m,Middle Eastern,no,no,Jordan,no,5,4-11 years,Parent,NO
2,1,1,0,0,0,1,1,1,0,0,...,m,?,no,no,Jordan,yes,5,4-11 years,?,NO
3,0,1,0,0,1,1,0,0,0,1,...,f,?,yes,no,Jordan,no,4,4-11 years,?,NO
4,1,1,1,1,1,1,1,1,1,1,...,m,Others,yes,no,United States,no,10,4-11 years,Parent,YES
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,1,1,1,1,1,1,1,1,1,1,...,f,White-European,yes,yes,United Kingdom,no,10,4-11 years,Parent,YES
288,1,0,0,0,1,0,1,0,0,1,...,f,White-European,yes,yes,Australia,no,4,4-11 years,Parent,NO
289,1,0,1,1,1,1,1,0,0,1,...,m,Latino,no,no,Brazil,no,7,4-11 years,Parent,YES
290,1,1,1,0,1,1,1,1,1,1,...,m,South Asian,no,no,India,no,9,4-11 years,Parent,YES


---

#**Global Air Pollution**

<img src="https://cpree.princeton.edu/sites/g/files/toruqf651/files/styles/16x9_750w_422h/public/pages/air-pollution-stsk_150001988.jpg?itok=JYyXokyU" width=600>  
<font size=2 color=gray>Image Credit: Center for Policy Research on Energy and the Environment, Princeton University</font>

Air Pollution is contamination that modifies the natural characteristics of the atmosphere often comes from household combustion devices, motor vehicles, industrial facilities and forest fires. Several pollutants are implicated in health concerns include particulate matter, carbon monoxide, ozone, nitrogen dioxide and sulfur dioxide.

This dataset provides geolocated information about the following pollutants:

Nitrogen Dioxide [NO2] : Nitrogen Dioxide is one of the pollutants coming from cars, trucks and buses emissions, power plants and off-road equipment. Exposure over short periods can aggravate respiratory diseases, like asthma. Longer exposures may contribute to develoment of asthma and respiratory infections. People with asthma, children and the elderly are at greater risk for the health effects of NO2.

Ozone [O3] : The Ozone molecule is harmful for outdoor air quality. It is created by chemical reactions between oxides of nitrogen and volatile organic compounds (VOC). Differently from the good ozone located in the upper atmosphere, ground level ozone can provoke several health problems like chest pain, coughing, throat irritation and airway inflammation. Furthermore it can reduce lung function and worsen bronchitis, emphysema, and asthma. Ozone affects also vegetation and ecosystems. In particular, it damages sensitive vegetation during the growing season.

Carbon Monoxide [CO] : Carbon Monoxide is a colorless and odorless gas. Outdoor, it is emitted in the air above all by cars, trucks and other vehicles or machineries that burn fossil fuels. Such items like kerosene and gas space heaters, gas stoves also release CO affecting indoor air quality. Breathing air with a high concentration of CO reduces the amount of oxygen that can be transported in the blood stream to critical organs like the heart and brain. At very high levels, which are not likely to occur outdoor but which are possible in enclosed environments. CO can cause dizziness, confusion, unconsciousness and death.

Particulate Matter [PM2.5] : Atmospheric Particulate Matter, also known as atmospheric aerosol particles, are complex mixtures of small solid and liquid matter that get into the air. If inhaled they can cause serious heart and lungs problem. They have been classified as group 1 carcinogen by the International Agengy for Research on Cancer (IARC). PM10 refers to those particules with a diameter of 10 micrometers or less. PM2.5 refers to those particles with a diameter of 2.5 micrometers or less.

Here are the descriptions of this dataset:
- `Country` : Name of the country
- `City` : Name of the city
- `AQI Value` : Overall AQI value of the city
- `AQI Category` : Overall AQI category of the city
- `CO AQI Value` : AQI value of Carbon Monoxide of the city
- `CO AQI Category` : AQI category of Carbon Monoxide of the city
- `Ozone AQI Value` : AQI value of Ozone of the city
- `Ozone AQI Category` : AQI category of Ozone of the city
- `NO2 AQI Value` : AQI value of Nitrogen Dioxide of the city
- `NO2 AQI Category` : AQI category of Nitrogen Dioxide of the city
- `PM2.5 AQI Value` : AQI value of Particulate Matter with a diameter of 2.5 micrometers or less of the city
- `PM2.5 AQI Category` : AQI category of Particulate Matter with a diameter of 2.5 micrometers or less of the city

Credit: Data source: [elichens](https://www.elichens.com/).

###**Prediction**

You are tasked to build a ML model to predict AQI value.

In [None]:
import pandas as pd
data="/content/air_pollution.csv"
df = pd.read_csv(data)
df

Unnamed: 0,Country,City,AQI Value,AQI Category,CO AQI Value,CO AQI Category,Ozone AQI Value,Ozone AQI Category,NO2 AQI Value,NO2 AQI Category,PM2.5 AQI Value,PM2.5 AQI Category
0,Russian Federation,Praskoveya,51,Moderate,1,Good,36,Good,0,Good,51,Moderate
1,Brazil,Presidente Dutra,41,Good,1,Good,5,Good,1,Good,41,Good
2,Italy,Priolo Gargallo,66,Moderate,1,Good,39,Good,2,Good,66,Moderate
3,Poland,Przasnysz,34,Good,1,Good,34,Good,0,Good,20,Good
4,France,Punaauia,22,Good,0,Good,22,Good,0,Good,6,Good
...,...,...,...,...,...,...,...,...,...,...,...,...
140794,United States of America,Renton,55,Moderate,1,Good,28,Good,9,Good,55,Moderate
140795,India,Rewari,177,Unhealthy,3,Good,154,Unhealthy,1,Good,177,Unhealthy
140796,Brazil,Rio Negrinho,38,Good,0,Good,15,Good,2,Good,38,Good
140797,France,Riom,53,Moderate,1,Good,33,Good,1,Good,53,Moderate
