<a href="https://colab.research.google.com/github/tomfox1/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd
import numpy as np

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [244]:
#merging data
phone_pop = pd.merge(cell_phones, population, how = "inner", on = ["geo", "time"])
phone_pop.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [245]:
#merging data 
geo_pop = geo_country_codes[["geo", "country"]].merge(phone_pop)
geo_pop.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
#easy way to calculate cell phone per person 
geo_pop["cell phones per person"] = geo_pop["cell_phones_total"] / geo_pop["population_total"]

Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [247]:
#converting our lowercase strings in "geo" to uppercase
geo_pop["geo"] = geo_pop["geo"].str.upper()
geo_pop.head(5)

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cell phones per person
0,AFG,Afghanistan,1960,0.0,8996351,0.0
1,AFG,Afghanistan,1965,0.0,9938414,0.0
2,AFG,Afghanistan,1970,0.0,11126123,0.0
3,AFG,Afghanistan,1975,0.0,12590286,0.0
4,AFG,Afghanistan,1976,0.0,12840299,0.0


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [248]:
#first describing all data 
geo_pop.describe(include = "all")


Unnamed: 0,geo,country,time,cell_phones_total,population_total,cell phones per person
count,8590,8590,8590.0,8590.0,8590.0,8590.0
unique,195,195,,,,
top,CYP,Azerbaijan,,,,
freq,46,46,,,,
mean,,,1994.1934807916184,9004949.642905472,29838230.581722934,0.2796385558059151
std,,,14.257974607310302,55734084.87217964,116128377.474773,0.454246656214052
min,,,1960.0,0.0,4433.0,0.0
25%,,,1983.0,0.0,1456148.0,0.0
50%,,,1995.0,6200.0,5725062.5,0.0015636266438163
75%,,,2006.0,1697652.0,18105812.0,0.4611491855201403


In [249]:
#describing data excluding numeric columns
geo_pop.describe(exclude = [np.number])

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,CYP,Azerbaijan
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [251]:
#it is important to create a condition for the year 2017 before using the groupby function to sort for "cell_phones_total"
condition = (geo_pop["time"] == 2017)
geo_2017 = geo_pop[condition]
geo_2017.groupby(["country"])["cell_phones_total"].sum().sort_values(ascending = False).head()



country
China           1,474,097,000.0
India           1,168,902,277.0
Indonesia         458,923,202.0
United States     395,881,000.0
Brazil            236,488,548.0
Name: cell_phones_total, dtype: float64

2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [299]:
#The year where number of cell phone surpasses people is the year where number of phones per person >= 1
#we create a condition to select the US as our country of choice
#after creating the condition we for values >= 1 and choose the first output
#since that's the first year (2014) where total cell phones surpassed population
countries = ["United States"]

condition = geo_pop['country'].isin(countries)

us = geo_pop[condition]

us[us["cell phones per person"] >= 1].head(1)


Unnamed: 0,geo,country,time,cell_phones_total,population_total,cell phones per person
8131,USA,United States,2014,355500000.0,317718779,1.118914031833164


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [300]:
#creating 2 conditions to reduce the scope of our data to our specifications
years = [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]
countries = ["China", "India", "United States", "Indonesia", "Brazil"]

condition = geo_pop['time'].isin(years)
condition2 = geo_pop['country'].isin(countries)

subset = geo_pop[condition]
subset = subset[condition2]

  


In [337]:
#our pivot table looks right
sub_pivot = subset.pivot_table(index = "country", columns = "time", values = "cell_phones_total")
sub_pivot

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


In [338]:
#passsing shape to confirm our findings 
sub_pivot.shape

(5, 11)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?

In [311]:
#note: attempted to do bonus question, however, not finished

years1 = [2007, 2017]
countries1 = ["China", "India", "United States", "Indonesia", "Brazil"]

new_condition = geo_pop['time'].isin(years1)
new_condition2 = geo_pop['country'].isin(countries1)

new_subset = geo_pop[new_condition]
new_subset = new_subset[new_condition2]

  


In [312]:
#conceptually we would calculate percentage differences in total phones from 2017 and 2007 and then create a new column to sort in ascending order
new_subset

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cell phones per person
1074,BRA,Brazil,2007,120980103.0,191026637,0.6333153580042348
1084,BRA,Brazil,2017,236488548.0,209288278,1.1299655683535224
1486,CHN,China,2007,547306000.0,1336800506,0.409414865975522
1496,CHN,China,2017,1474097000.0,1409517397,1.0458168186766978
3539,IDN,Indonesia,2007,93386881.0,232989141,0.4008207446886977
3549,IDN,Indonesia,2017,458923202.0,263991379,1.738402230172827
3585,IND,India,2007,233620000.0,1179681239,0.198036547735587
3595,IND,India,2017,1168902277.0,1339180127,0.8728491809526382
8124,USA,United States,2007,249300000.0,300595175,0.8293546295279024
8134,USA,United States,2017,395881000.0,324459463,1.2201246847283354


In [313]:
#We would need to calculate 5 values, it could be done by hand and then inputed, however, that is not the task at hand 
#with more time I would have enjoyed trying to calculate programatically 
new_sub_pivot = new_subset.pivot_table(index = "time", columns = "country", values = "cell_phones_total")
new_sub_pivot

country,Brazil,China,India,Indonesia,United States
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2007,120980103.0,547306000.0,233620000.0,93386881.0,249300000.0
2017,236488548.0,1474097000.0,1168902277.0,458923202.0,395881000.0
