# How to ask questions to your data

1. Look at your data.
  - What features does it have? (columns in a table, properties of objects etc.)
  - Where did it come from (who produced it and why)
  - How extensive is it? (Does it seem probable that the dataset is large enough to represent reality to some degree?)
2. Think about what questions you could ask based on the above
  - What is the most emediate questions you get when looking at the data
  - Does it seem interesting? (Otherwise find different data!)
  - What does the features give you access to
  - Can you group or sort the data to make it give more meaning
  - Can the data features be aggregated to give more meaning
3. Do you need to clean your data?
  - Are there missing features in your samples? (NaN, null, empty strings etc.)
  - Are some features a collection of values (comma separated strings)
  - Are some features not nummeric? (then are the categories to group data by or are they categorical in nature so that they can be represented by a number?


# Example
[WHAT THE DATA MEANS](https://github.com/owid/poverty-data/blob/main/datasets/pip_codebook.csv)


In [1]:
url = "https://github.com/owid/poverty-data/raw/main/datasets/pip_dataset.csv"
import requests
csv_content = requests.get(url).text


In [2]:
import pandas as pd
from io import StringIO # To be able to treat our string value as a file
io_file = StringIO(csv_content)
df = pd.read_csv(io_file, sep=",")
df

Unnamed: 0,country,year,reporting_level,welfare_type,ppp_version,survey_year,survey_comparability,headcount_ratio_international_povline,headcount_ratio_lower_mid_income_povline,headcount_ratio_upper_mid_income_povline,...,decile8_thr,decile9_thr,gini,mld,polarization,palma_ratio,s80_s20_ratio,p90_p10_ratio,p90_p50_ratio,p50_p10_ratio
0,Albania,1996,national,consumption,2011,1996.0,0.0,0.920669,11.174149,44.618417,...,8.85,10.92,0.270103,0.119104,0.241293,0.928335,3.945872,3.568627,1.889273,1.888889
1,Albania,2002,national,consumption,2011,2002.0,1.0,1.570843,14.132118,49.669635,...,8.83,11.58,0.317390,0.164812,0.268982,1.215056,4.831625,3.979381,2.090253,1.903780
2,Albania,2005,national,consumption,2011,2005.0,1.0,0.860527,8.715685,38.545254,...,10.02,12.78,0.305957,0.154413,0.254529,1.142718,4.662236,3.872727,1.978328,1.957576
3,Albania,2008,national,consumption,2011,2008.0,1.0,0.313650,5.250542,31.110345,...,10.74,13.62,0.299847,0.148893,0.247311,1.114657,4.395911,3.574803,1.956897,1.826772
4,Albania,2012,national,consumption,2011,2012.0,1.0,0.849754,6.182414,34.528906,...,10.52,13.26,0.289605,0.138417,0.249988,1.041193,4.272573,3.632877,1.941435,1.871233
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4872,Zambia,2010,national,consumption,2017,2010.0,4.0,68.456606,82.885088,93.044964,...,3.27,5.34,0.556215,0.539135,0.536149,4.278696,16.024650,10.470588,3.955556,2.647059
4873,Zambia,2015,national,consumption,2017,2015.0,4.0,61.352160,77.548045,90.747141,...,4.02,6.43,0.571361,0.604667,0.613181,4.995829,21.243915,13.978261,4.095541,3.413043
4874,Zimbabwe,2011,national,consumption,2017,2011.0,0.0,21.580063,47.857143,77.945113,...,7.25,10.54,0.431536,0.311153,0.388356,2.207938,8.526847,6.713376,2.773684,2.420382
4875,Zimbabwe,2017,national,consumption,2017,2017.0,0.0,34.206046,61.583570,84.109019,...,5.96,8.81,0.443371,0.322848,0.416694,2.307359,8.569542,6.574627,3.069686,2.141791


In [3]:
df.head(15)

Unnamed: 0,country,year,reporting_level,welfare_type,ppp_version,survey_year,survey_comparability,headcount_ratio_international_povline,headcount_ratio_lower_mid_income_povline,headcount_ratio_upper_mid_income_povline,...,decile8_thr,decile9_thr,gini,mld,polarization,palma_ratio,s80_s20_ratio,p90_p10_ratio,p90_p50_ratio,p50_p10_ratio
0,Albania,1996,national,consumption,2011,1996.0,0.0,0.920669,11.174149,44.618417,...,8.85,10.92,0.270103,0.119104,0.241293,0.928335,3.945872,3.568627,1.889273,1.888889
1,Albania,2002,national,consumption,2011,2002.0,1.0,1.570843,14.132118,49.669635,...,8.83,11.58,0.31739,0.164812,0.268982,1.215056,4.831625,3.979381,2.090253,1.90378
2,Albania,2005,national,consumption,2011,2005.0,1.0,0.860527,8.715685,38.545254,...,10.02,12.78,0.305957,0.154413,0.254529,1.142718,4.662236,3.872727,1.978328,1.957576
3,Albania,2008,national,consumption,2011,2008.0,1.0,0.31365,5.250542,31.110345,...,10.74,13.62,0.299847,0.148893,0.247311,1.114657,4.395911,3.574803,1.956897,1.826772
4,Albania,2012,national,consumption,2011,2012.0,1.0,0.849754,6.182414,34.528906,...,10.52,13.26,0.289605,0.138417,0.249988,1.041193,4.272573,3.632877,1.941435,1.871233
5,Albania,2014,national,consumption,2011,2014.0,2.0,1.580897,11.615621,37.033842,...,11.74,15.78,0.345989,0.198662,0.32431,1.376215,5.930924,5.17377,2.296943,2.252459
6,Albania,2015,national,consumption,2011,2015.0,2.0,0.245098,4.690767,24.461185,...,13.8,18.26,0.327537,0.175495,0.300364,1.248506,5.278241,4.599496,2.229548,2.062972
7,Albania,2016,national,consumption,2011,2016.0,2.0,0.410364,5.462612,23.903491,...,14.78,19.18,0.337363,0.187581,0.31438,1.306972,5.62688,4.930591,2.251174,2.190231
8,Albania,2016,national,income,2011,2016.0,4.0,6.681604,17.621649,39.949891,...,11.88,16.02,0.385656,0.27374,0.35036,1.696002,8.286575,6.965217,2.376855,2.930435
9,Albania,2017,national,consumption,2011,2017.0,2.0,0.428475,4.284743,23.828625,...,14.54,18.82,0.330557,0.179908,0.308522,1.259837,5.389502,4.635468,2.229858,2.078818


In [None]:
# filters:
only_denmark = (df['country']=='Denmark')
only_dollar_adjustd_2017 = (df['ppp_version']==2017)
only_columns = df[['year','headcount_ratio_international_povline']] # extract a single column
denmark = only_columns[only_denmark & only_dollar_adjustd_2017]
denmark.columns=['year','poor in %'] # rename columns
denmark.set_index('year', inplace=True) # set year as index
print('Percentage of danish population living below international poverty line:')

# slice the danmark dataframe to only rows with index between 2005 and 2015
denmark.loc[2005:2015]

# Possible questions
- Has global poverty seen a decline or growth from 1996 til 2019?
- What is the ratio of peole living for under 10$ a day in Denmark
- How much has it changed over 30 years?
- What is the country with the highest
- What is the country with the largets population of poor people (Living below one USD a day)
- How have the poverty changed in Eastern Europe over 30 years.
- And many many more ...

## Strategy
1. Are there missing data on any of my essential rows
2. Can I ensure that the data is comparable across columns and rows?
3. Do I need to remove rows or columns, that contains missing or invalid data?
4. How can i slice the data to only get what is relevant for each question:
  - For the second question I only need Denmark which is row 2961-2981
  - And only the column that represents the poverty data

# Class exercise
1. Find some data online that you think is interesting
2. Download and examine it carefully
3. Ask 5 questions to the data
4. Outlay a plan for how you can answer the questions
5. Implement your plan