# Introduction

I have been hired by an organization that strives to improve educational outcomes for children and young people in Chicago. My job is to analyze the census, crime, and school data for a given neighborhood or district. 

I will identify causes that impact the enrollment, safety, health, environment ratings of schools.

## Selected Socioeconomic Indicators in Chicago

The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal.
This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.

Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.

A detailed description of the dataset can be found on [the city of Chicago's website](https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDB0201ENSkillsNetwork20127838-2022-01-01), but to summarize, the dataset has the following variables:

*   **Community Area Number** (`ca`): Used to uniquely identify each row of the dataset

*   **Community Area Name** (`community_area_name`): The name of the region in the city of Chicago

*   **Percent of Housing Crowded** (`percent_of_housing_crowded`): Percent of occupied housing units with more than one person per room

*   **Percent Households Below Poverty** (`percent_households_below_poverty`): Percent of households living below the federal poverty line

*   **Percent Aged 16+ Unemployed** (`percent_aged_16_unemployed`): Percent of persons over the age of 16 years that are unemployed

*   **Percent Aged 25+ without High School Diploma** (`percent_aged_25_without_high_school_diploma`): Percent of persons over the age of 25 years without a high school education

*   **Percent Aged Under** 18 or Over 64:Percent of population under 18 or over 64 years of age (`percent_aged_under_18_or_over_64`): (ie. dependents)

*   **Per Capita Income** (`per_capita_income_`): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population

*   **Hardship Index** (`hardship_index`): Score that incorporates each of the six selected socioeconomic indicators




# Method

First I generated all the three tables in the PostgreSQL database.

Then I insert the data and fill the tables.

Now these tables are ready to be analysed.

### Connect to the database

Let us first load the SQL extension and establish a connection with the database

##### The syntax for connecting to magic sql using sqllite is

**%sql sqlite://DatabaseName**

where DatabaseName will be your **.db** file


In [1]:
%load_ext sql

In [20]:
import csv, sqlite3
import pandas as pd 

conn = sqlite3.connect("socioeconomic.db")
cur = conn.cursor()


In [21]:
%sql sqlite:///socioeconomic.db

I will convert the csv file  to a table in sqlite  with the csv data loaded in it.


In [22]:
df = pd.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')
df.to_sql("chicago_socioeconomic_data", conn, if_exists='replace', index=False,method="multi")


78

I will check if everything is ok:

In [23]:
%sql SELECT * FROM chicago_socioeconomic_data limit 5;

 * sqlite:///socioeconomic.db
Done.


ca,community_area_name,percent_of_housing_crowded,percent_households_below_poverty,percent_aged_16_unemployed,percent_aged_25_without_high_school_diploma,percent_aged_under_18_or_over_64,per_capita_income_,hardship_index
1.0,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39.0
2.0,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46.0
3.0,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20.0
4.0,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17.0
5.0,North Center,0.3,7.5,5.2,4.5,26.2,57123,6.0


## Problems

### Problem 1

##### How many rows are in the dataset?


In [24]:
df=pd.read_sql_query("SELECT * FROM chicago_socioeconomic_data", conn)
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 9 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   ca                                           77 non-null     float64
 1   community_area_name                          78 non-null     object 
 2   percent_of_housing_crowded                   78 non-null     float64
 3   percent_households_below_poverty             78 non-null     float64
 4   percent_aged_16_unemployed                   78 non-null     float64
 5   percent_aged_25_without_high_school_diploma  78 non-null     float64
 6   percent_aged_under_18_or_over_64             78 non-null     float64
 7   per_capita_income_                           78 non-null     int64  
 8   hardship_index                               77 non-null     float64
dtypes: float64(7), int64(1), object(1)
memory usage: 5.6+ KB


### Problem 2

##### How many community areas in Chicago have a hardship index greater than 50.0?


In [27]:
df=pd.read_sql_query("SELECT COUNT(DISTINCT(community_area_name)) as Number_communityareas_hardship_index_greater_50 FROM chicago_socioeconomic_data \
                WHERE hardship_index>50 ", conn)
df


Unnamed: 0,Number_communityareas_hardship_index_greater_50
0,38


### Problem 3

##### What is the maximum value of hardship index in this dataset?


In [28]:
df=pd.read_sql_query("SELECT MAX(hardship_index) as Max_hardship_index FROM chicago_socioeconomic_data", conn)
df

Unnamed: 0,Max_hardship_index
0,98.0


### Problem 4

##### Which community area which has the highest hardship index?


In [30]:
df=pd.read_sql_query("SELECT community_area_name, MAX(hardship_index) FROM chicago_socioeconomic_data \
               WHERE hardship_index IS NOT NULL \
               GROUP BY community_area_name \
               ORDER BY MAX(hardship_index) DESC \
               LIMIT 1", conn)
df

Unnamed: 0,community_area_name,MAX(hardship_index)
0,Riverdale,98.0


### Problem 5

##### Which Chicago community areas have per-capita incomes greater than $60,000?


In [32]:
df=pd.read_sql_query("SELECT community_area_name, MAX(per_capita_income_) FROM chicago_socioeconomic_data \
               WHERE per_capita_income_ IS NOT NULL AND per_capita_income_>60000 \
               GROUP BY community_area_name \
               ORDER BY MAX(per_capita_income_) DESC", conn)
df


Unnamed: 0,community_area_name,MAX(per_capita_income_)
0,Near North Side,88669
1,Lincoln Park,71551
2,Loop,65526
3,Lake View,60058


In [19]:
conn.close()