<a href="https://colab.research.google.com/github/Granero0011/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [2]:
cell_phones.head()

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0
3,abw,1975,0.0
4,abw,1976,0.0


In [3]:
population.head()

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000
3,afg,1803,3280000
4,afg,1804,3280000


In [5]:
#Let's merge the specific tables
jointtable = pd.merge(cell_phones[["cell_phones_total",'geo', 'time']], 
              population[['geo','time', 'population_total']])
jointtable.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [6]:
jointtable.head()

Unnamed: 0,cell_phones_total,geo,time,population_total
0,0.0,afg,1960,8996351
1,0.0,afg,1965,9938414
2,0.0,afg,1970,11126123
3,0.0,afg,1975,12590286
4,0.0,afg,1976,12840299


In [0]:
jointtable2 = pd.merge(jointtable[["cell_phones_total",'population_total','time','geo']], 
              geo_country_codes[['geo','country']])

In [8]:
#Check the shape
jointtable2.shape

(8590, 5)

In [9]:
print(jointtable2)

      cell_phones_total  population_total  time  geo      country
0                   0.0           8996351  1960  afg  Afghanistan
1                   0.0           9938414  1965  afg  Afghanistan
2                   0.0          11126123  1970  afg  Afghanistan
3                   0.0          12590286  1975  afg  Afghanistan
4                   0.0          12840299  1976  afg  Afghanistan
5                   0.0          13067538  1977  afg  Afghanistan
6                   0.0          13237734  1978  afg  Afghanistan
7                   0.0          13306695  1979  afg  Afghanistan
8                   0.0          13248370  1980  afg  Afghanistan
9                   0.0          13053954  1981  afg  Afghanistan
10                  0.0          12749645  1982  afg  Afghanistan
11                  0.0          12389269  1983  afg  Afghanistan
12                  0.0          12047115  1984  afg  Afghanistan
13                  0.0          11783050  1985  afg  Afghanistan
14        

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
import numpy as np
#Let's calculate number of cell phones per person
jointtable2['cell_phones_per_person']=jointtable2["cell_phones_total"]/jointtable2['population_total']

In [119]:
print(jointtable2[jointtable2['country']=='United States'])

      cell_phones_total  population_total  time  geo        country  \
8092                0.0         186808228  1960  Usa  United States   
8093                0.0         199815540  1965  Usa  United States   
8094                0.0         209588150  1970  Usa  United States   
8095                0.0         219205296  1975  Usa  United States   
8096                0.0         221239215  1976  Usa  United States   
8097                0.0         223324042  1977  Usa  United States   
8098                0.0         225449657  1978  Usa  United States   
8099                0.0         227599878  1979  Usa  United States   
8100                0.0         229763052  1980  Usa  United States   
8101            91600.0         238573861  1984  Usa  United States   
8102           340213.0         240824120  1985  Usa  United States   
8103           681825.0         243098935  1986  Usa  United States   
8104          1230855.0         245402864  1987  Usa  United States   
8105  

Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [159]:
#Upgrade to upper cases
jointtable2['geo']=jointtable2['geo'].str.upper()
print(jointtable2.head())

   cell_phones_total  population_total  time  geo      country  \
0                0.0           8996351  1960  AFG  Afghanistan   
1                0.0           9938414  1965  AFG  Afghanistan   
2                0.0          11126123  1970  AFG  Afghanistan   
3                0.0          12590286  1975  AFG  Afghanistan   
4                0.0          12840299  1976  AFG  Afghanistan   

   cell_phones_per_person  
0                     0.0  
1                     0.0  
2                     0.0  
3                     0.0  
4                     0.0  


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [162]:
jointtable2.describe()

Unnamed: 0,cell_phones_total,population_total,time,cell_phones_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,9004950.0,29838230.0,1994.193481,0.279639
std,55734080.0,116128400.0,14.257975,0.454247
min,0.0,4433.0,1960.0,0.0
25%,0.0,1456148.0,1983.0,0.0
50%,6200.0,5725062.0,1995.0,0.001564
75%,1697652.0,18105810.0,2006.0,0.461149
max,1474097000.0,1409517000.0,2017.0,2.490243


In [26]:
jointtable2['country'].describe()

count        8590
unique        195
top       Romania
freq           46
Name: country, dtype: object

In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [69]:
#Filter by date
f= jointtable2[jointtable2['time']==2017]
df=pd.DataFrame()
df['country']=f['country']
df['cell_phones_total']=f['cell_phones_total']

#Sort the values
df.sort_values("cell_phones_total", inplace = True, ascending=False) 
df.head(5)

Unnamed: 0,country,cell_phones_total
1496,China,1474097000.0
3595,India,1168902000.0
3549,Indonesia,458923200.0
8134,United States,395881000.0
1084,Brazil,236488500.0


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [87]:
import numpy as np
u=jointtable2[jointtable2['country']=='United States']
u['diff']=u['cell_phones_total']-u['population_total']
print(u[u['diff']>0].head(1))

#It is 2014

      cell_phones_total  population_total  time  geo        country  \
8131        355500000.0         317718779  2014  Usa  United States   

            diff  
8131  37781221.0  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [122]:
#Let's get each country's data
#Starting with China
v1=jointtable2[jointtable2['time']>=2007] 
vc1=v1[v1['country']=='China']
vc1.reset_index(inplace=True)
vc1.head()

Unnamed: 0,index,cell_phones_total,population_total,time,geo,country,cell_phones_per_person
0,1486,547306000.0,1336800506,2007,Chn,China,0.409415
1,1487,641245000.0,1344415227,2008,Chn,China,0.476969
2,1488,747214000.0,1352068091,2009,Chn,China,0.552645
3,1489,859003000.0,1359755102,2010,Chn,China,0.631734
4,1490,986253000.0,1367480264,2011,Chn,China,0.721219


In [127]:
#India
v2=jointtable2[jointtable2['time']>=2007] 
vc2=v2[v2['country']=='India']
vc2.reset_index(inplace=True)
vc2.head()

Unnamed: 0,index,cell_phones_total,population_total,time,geo,country,cell_phones_per_person
0,3585,233620000.0,1179681239,2007,Ind,India,0.198037
1,3586,346890000.0,1197146906,2008,Ind,India,0.289764
2,3587,525090000.0,1214270132,2009,Ind,India,0.432433
3,3588,752190000.0,1230980691,2010,Ind,India,0.611049
4,3589,893862478.0,1247236029,2011,Ind,India,0.716675


In [130]:
#United States
v3=jointtable2[jointtable2['time']>=2007] 
vc3=v3[v3['country']=='United States']
vc3.reset_index(inplace=True)
vc3.head()

Unnamed: 0,index,cell_phones_total,population_total,time,geo,country,cell_phones_per_person
0,8124,249300000.0,300595175,2007,Usa,United States,0.829355
1,8125,261300000.0,303374067,2008,Usa,United States,0.861313
2,8126,274283000.0,306076362,2009,Usa,United States,0.896126
3,8127,285118000.0,308641391,2010,Usa,United States,0.923784
4,8128,297404000.0,311051373,2011,Usa,United States,0.956125


In [131]:
#Indonesia
v4=jointtable2[jointtable2['time']>=2007] 
vc4=v4[v4['country']=='Indonesia']
vc4.reset_index(inplace=True)
vc4.head()

Unnamed: 0,index,cell_phones_total,population_total,time,geo,country,cell_phones_per_person
0,3539,93386881.0,232989141,2007,Idn,Indonesia,0.400821
1,3540,140578243.0,236159276,2008,Idn,Indonesia,0.595269
2,3541,163676961.0,239340478,2009,Idn,Indonesia,0.683867
3,3542,211290235.0,242524123,2010,Idn,Indonesia,0.871213
4,3543,249805619.0,245707511,2011,Idn,Indonesia,1.016679


In [132]:
#Brazil
v5=jointtable2[jointtable2['time']>=2007] 
vc5=v5[v5['country']=='Brazil']
vc5.reset_index(inplace=True)
vc5.head()



Unnamed: 0,index,cell_phones_total,population_total,time,geo,country,cell_phones_per_person
0,1074,120980103.0,191026637,2007,Bra,Brazil,0.633315
1,1075,150641403.0,192979029,2008,Bra,Brazil,0.78061
2,1076,169385584.0,194895996,2009,Bra,Brazil,0.869108
3,1077,196929978.0,196796269,2010,Bra,Brazil,1.000679
4,1078,234357507.0,198686688,2011,Bra,Brazil,1.179533


In [0]:
#Let's pack all this info into one big dataframe
frames=[vc1, vc2, vc3, vc4, vc5]
final = pd.concat(frames)
final.reset_index(inplace=True)

In [124]:
final.head()

Unnamed: 0,level_0,cell_phones_per_person,cell_phones_total,country,geo,index,population_total,time
0,0,0.409415,547306000.0,China,Chn,1486.0,1336800506,2007
1,1,0.476969,641245000.0,China,Chn,1487.0,1344415227,2008
2,2,0.552645,747214000.0,China,Chn,1488.0,1352068091,2009
3,3,0.631734,859003000.0,China,Chn,1489.0,1359755102,2010
4,4,0.721219,986253000.0,China,Chn,1490.0,1367480264,2011


In [125]:
#Let's create the pivot table
table = pd.pivot_table(final, values='cell_phones_total', index=['country'],columns=['time'])
table

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323700.0,271099800.0,280728800.0,257814300.0,244067400.0,236488500.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984000.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720900.0,886304200.0,944008700.0,1001056000.0,1127809000.0,1168902000.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963700.0,313226900.0,325582800.0,338948300.0,385573400.0,458923200.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


In [115]:
#Check the shape
table.shape

(5, 11)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [153]:
bonus = pd.DataFrame({'country':['China', "India",'United States', ' Indonesia', 'Brazil' ], 'increase':[0, 0, 0, 0, 0]})
#China
bonus['increase'][0]=vc1['cell_phones_total'][len(vc1['cell_phones_total'])-1]-vc1['cell_phones_total'][0]
print(bonus['increase'][0])


926791000


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [154]:
#India
bonus['increase'][1]=vc2['cell_phones_total'][len(vc2['cell_phones_total'])-1]-vc2['cell_phones_total'][0]
print(bonus['increase'][1])
#That country is india

935282277


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [155]:
#United States
bonus['increase'][2]=vc3['cell_phones_total'][len(vc3['cell_phones_total'])-1]-vc3['cell_phones_total'][0]
print(bonus['increase'][2])


146581000


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [156]:
#United States
bonus['increase'][3]=vc4['cell_phones_total'][len(vc4['cell_phones_total'])-1]-vc4['cell_phones_total'][0]
print(bonus['increase'][3])

365536321


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [157]:
#Brazil
bonus['increase'][4]=vc5['cell_phones_total'][len(vc5['cell_phones_total'])-1]-vc5['cell_phones_total'][0]
print(bonus['increase'][4])

115508445


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [158]:
bonus.sort_values("increase", inplace = True, ascending=False) 
bonus.head()

Unnamed: 0,country,increase
1,India,935282277
0,China,926791000
3,Indonesia,365536321
2,United States,146581000
4,Brazil,115508445


If you have the time and curiosity, what other questions can you ask and answer with this data?