<a href="https://colab.research.google.com/github/NicoMontoya/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/Nico_montoya_DS3_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [2]:
cell_phones.head()

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0
3,abw,1975,0.0
4,abw,1976,0.0


In [3]:
population.head()

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000
3,afg,1803,3280000
4,afg,1804,3280000


In [5]:
cell_phones_and_population = pd.merge(cell_phones, population, how='inner', on=['geo', 'time'])
cell_phones_and_population.head()

Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


In [9]:
## check to see that final shape is (8590, 4) 

cell_phones.shape, population.shape, cell_phones_and_population.shape

((9215, 3), (59297, 3), (8590, 4))

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [23]:
cp_pop_country = pd.merge(geo_country_codes[['geo', 'country']], cell_phones_and_population, how='inner', on='geo')
cp_pop_country.head()

Unnamed: 0,geo,country,time,cell_phones_total,population_total
0,afg,Afghanistan,1960,0.0,8996351
1,afg,Afghanistan,1965,0.0,9938414
2,afg,Afghanistan,1970,0.0,11126123
3,afg,Afghanistan,1975,0.0,12590286
4,afg,Afghanistan,1976,0.0,12840299


In [25]:
## check to see if the shape is still good
cp_pop_country.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
## cell phones per person = 'cell_phones_total' / population_total
cp_pop_country['cp_per_person'] = cp_pop_country['cell_phones_total'] / cp_pop_country['population_total']


In [35]:
## check to see if it worked
cp_pop_country[(cp_pop_country['country'] == 'United States') & (cp_pop_country['time'] == 2017)]

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cp_per_person
8134,usa,United States,2017,395881000.0,324459463,1.220125


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
cp_pop_country['geo'] = cp_pop_country['geo'].str.upper()

In [40]:
cp_pop_country.tail()

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cp_per_person
8585,ZWE,Zimbabwe,2013,13633167.0,15054506,0.905587
8586,ZWE,Zimbabwe,2014,11798652.0,15411675,0.765566
8587,ZWE,Zimbabwe,2015,12757410.0,15777451,0.808585
8588,ZWE,Zimbabwe,2016,12878926.0,16150362,0.797439
8589,ZWE,Zimbabwe,2017,14092104.0,16529904,0.852522


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [41]:
## describe numeric colummns

cp_pop_country.describe()

Unnamed: 0,time,cell_phones_total,population_total,cp_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,0.279639
std,14.257975,55734080.0,116128400.0,0.454247
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.0,0.001564
75%,2006.0,1697652.0,18105810.0,0.461149
max,2017.0,1474097000.0,1409517000.0,2.490243


In [43]:
## describe non-numeric columns
cp_pop_country.describe(exclude='number')

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,MDG,Rwanda
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [52]:
## first I chose rows where 'time' column was equal to '2017'
## second I sorted the values by 'cell_phones_total' in descending order.
## third I picked the columns I wanted to show
## lastly I chose to display only five from the top of the list
cp_pop_country[cp_pop_country['time']==2017].sort_values(by='cell_phones_total', ascending=False)[['country', 'cell_phones_total']].head()

Unnamed: 0,country,cell_phones_total
1496,China,1474097000.0
3595,India,1168902277.0
3549,Indonesia,458923202.0
8134,United States,395881000.0
1084,Brazil,236488548.0


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [62]:
## want USA rows only
## want rows only where USA cell phones total is greater than population
## want to show data in format where 'year' is ascending

(cp_pop_country[(cp_pop_country['cell_phones_total'] > cp_pop_country['population_total'])
                & (cp_pop_country['geo'] == 'USA')].sort_values(by='time', ascending=True))

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cp_per_person
8131,USA,United States,2014,355500000.0,317718779,1.118914031833164
8132,USA,United States,2015,382307000.0,319929162,1.1949739048796058
8133,USA,United States,2016,395881000.0,322179605,1.228758722948959
8134,USA,United States,2017,395881000.0,324459463,1.2201246847283354


In [64]:
## quick check to see that in 2013, USA did still have more people than cell phones

cp_pop_country[(cp_pop_country['time'] == 2013) & (cp_pop_country['geo'] == 'USA')]

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cp_per_person
8130,USA,United States,2013,310698000.0,315536676,0.9846652501340288


**Looking at the data, it seems that the first year USA had more cell phones than people was in 2014**

## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
## make a df that only has years after 2007(including 2007)
cpc_year = cp_pop_country[cp_pop_country['time'] >= 2007]


In [0]:
## modify a df further by including only the countries that you are interested in

countries = ['China', 'India', 'United States', 'Indonesia', 'Brazil']
cpc_year_country = cpc_year[cpc_year['country'].isin(countries)]


In [0]:
## create pivot table. index = countries, columns = time, values = cell phones total
cpc_pivot_table = cpc_year_country.pivot_table(index='country', columns='time', values='cell_phones_total')

In [72]:
## check to see correctness
cpc_pivot_table.shape

(5, 11)

In [73]:
cpc_pivot_table

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [0]:
## list of 2017 total cell phones
last_year = list(cpc_year_country[cpc_year_country['time'] == 2017]['cell_phones_total'])

In [0]:
## list of 2007 total cell phones
first_year = list(cpc_year_country[cpc_year_country['time'] == 2007]['cell_phones_total'])

In [92]:
## list comprehension that gives the differences of cell phones total
## from 2017 and 2007

dif = [last_year[i]-first_year[i] for i in range(5)]
dif
  

[115508445.0, 926791000.0, 365536321.0, 935282277.0, 146581000.0]

In [125]:
# made a dataframe with corresponding difference counts.. not the best way.

df = pd.DataFrame(
        {'Brazil': 115508445.0,
        'China': 926791000.0,
        'India': 365536321.0,
        'Indonesia': 935282277.0,
        'United States': 146581000.0},
        index=[1,2,3,4,5])
df.T.sort_values(by=1, ascending=False)[1]

Indonesia       935,282,277.0
China           926,791,000.0
India           365,536,321.0
United States   146,581,000.0
Brazil          115,508,445.0
Name: 1, dtype: float64

Solution is not the cleanest.. to get to the table.  But I was able to figure out that it was 
Indonesia that had the largest increase in cell phones from 2007 to 2017.  They are the ones that increased by 935, 282,277 cell phones

If you have the time and curiosity, what other questions can you ask and answer with this data?