# Data Cleaning and Analysis
## Combining Data with Pandas

| pd.concat()                                                             	| pd.merge()                                                                                                   	| Key                                                                                                                                                                                                               	|
|-------------------------------------------------------------------------	|--------------------------------------------------------------------------------------------------------------	|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
| Default Join Type                                                       	| Outer                                                                                                        	| Inner                                                                                                                                                                                                             	|
| Can Combine More Than Two Dataframes at a Time?                         	| Yes                                                                                                          	| No                                                                                                                                                                                                                	|
| Can Combine Dataframes Vertically<br>(axis=0) or Horizontally (axis=1)? 	| Both                                                                                                         	| Horizontally                                                                                                                                                                                                      	|
| Syntax                                                                  	| Concat (Vertically)<br>concat([df1,df2,df3])<br><br>Concat (Horizontally)<br>concat([df1,df2,df3], axis = 1) 	| Merge (Join on Columns)<br>merge(left = df1, right = df2, how = 'join_type', on = 'Col')<br><br>Merge (Join on Index)<br>merge(left = df1, right = df2, how = 'join_type', left_index = True, right_index = True) 	|

In the last exercise, we confirmed that the mean world happiness score stayed approximately the same from 2015 to 2017.

In this mission, we learned how to combine data using the pd.concat() and pd.merge() functions. In your travels with pandas, you may happen upon the df.append() and df.join() methods, which are basically shortcuts for the concat() and merge() functions. We didn't cover them in this mission, as the differences are few, but if you want to learn more about them, check out this documentation.

As we saw in the last screen, in order to perform more complex analysis, we have to be able to clean and manipulate data, whether it be adding data to a dataframe or renaming a column. In the next mission, we'll continue building on what we've learned so far as we learn ways to transform and reshape our data.

### Import Statements

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Stylistics

In [2]:
%matplotlib inline
plt.style.use("dark_background")
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_column", 500)
pd.set_option("display.width", 1000)

### Importing Dataset

In [3]:
happiness2015 = pd.read_csv("../datasets/World_Happiness_2015.csv")

### Introduction
We've already read the World_Happiness_2015.csv file into a dataframe called happiness2015.

- Use the pandas.read_csv() function to read the World_Happiness_2016.csv file into a dataframe called happiness2016 and the World_Happiness_2017.csv file into a dataframe called happiness2017.
- Add a column called Year to each dataframe with the corresponding year. For example, the Year column in happiness2015 should contain the value 2015 for each row.

In [4]:
happiness2016 = pd.read_csv("../World_Happiness_2016.csv")
happiness2017 = pd.read_csv("../World_Happiness_2017.csv")

for year, dataset in zip([2015, 2016, 2017], [happiness2015, happiness2016, happiness2017]):
    dataset["Year"] = year

FileNotFoundError: [Errno 2] No such file or directory: '../World_Happiness_2016.csv'

### Combining Dataframes with the Concat Function
We've already saved the subsets from happiness2015 and happiness2016 to the variables head_2015 and head_2016.

- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 0. Remember to pass the head_2015 and head_2016 into the function as a list. Assign the result to concat_axis0.
- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 1. Remember to pass head_2015 and head_2016 into the function as a list and set the axis parameter equal to 1. Assign the result to concat_axis1.
- Use the variable inspector to view concat_axis0 and concat_axis1.
    - Assign the number of rows in concat_axis0 to a variable called question1.
    - Assign the number of rows in concat_axis1 to a variable called question2.

In [None]:
head_2015 = happiness2015[['Country','Happiness Score', 'Year']].head(3)
head_2016 = happiness2016[['Country','Happiness Score', 'Year']].head(3)

concat_axis0 = pd.concat([head_2015, head_2016])
concat_axis1 = pd.concat([head_2015, head_2016], axis=1)

question1 = concat_axis0.shape[0]
concat_axis2 = concat_axis1.shape[0]

### Combining Dataframes with the Concat Function Continued
We've already created the head_2015 and head_2016 variables.

- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 0. Remember to pass the head_2015 and head_2016 into the function as a list. Assign the result to concat_axis0.
- Use the variable inspector to view concat_axis0.
    - Assign the number of rows in concat_axis0 to a variable called rows.
    - Assign the number of columns in concat_axis0 to a variable called columns.

In [None]:
head_2015 = happiness2015[['Year','Country','Happiness Score', 'Standard Error']].head(4)
head_2016 = happiness2016[['Country','Happiness Score', 'Year']].head(3)

concat_axis0 = pd.concat([head_2015, head_2016])

rows = concat_axis0.shape[0]
columns = concat_axis0.shape[1]

### Combining Dataframes with Different Shapes Using the Concat Function
- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 0 again. This time, however, set the ignore_index parameter to True to reset the index in the result. Assign the result to concat_update_index.
    - Use the variable inspector to view the results and confirm the index was reset.

In [None]:
head_2015 = happiness2015[['Year','Country','Happiness Score', 'Standard Error']].head(4)
head_2016 = happiness2016[['Country','Happiness Score', 'Year']].head(3)

concat_update_index = pd.concat([head_2015, head_2016], ignore_index=True)

### Joining Dataframes with the Merge Function
We've already saved three rows from happiness2015 and happiness2016 to variables named three_2015 and three_2016.

- Use the pd.merge() function to join three_2015 and three_2016 on the Country column. Assign the result to merged.

In [None]:
three_2015 = happiness2015[['Country','Happiness Rank','Year']].iloc[2:5]
three_2016 = happiness2016[['Country','Happiness Rank','Year']].iloc[2:5]

merged = pd.merge(left=three_2015, right=three_2016, on="Country")

### Joining on Columns with the Merge Function

There are actually four different types of joins:

1. Inner: only includes elements that appear in both dataframes with a common key
2. Outer: includes all data from both dataframes
3. Left: includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key; the result retains all columns from both of the original dataframes
4. Right: includes all of the rows from the "right" dataframe along with any rows from the "left" dataframe with a common key; the result retains all columns from both of the original dataframes
- Update merged to use a left join instead of an inner join. Set the how parameter to 'left' in merge(). Assign the result to merged_left.
- Update merged_left so that the left parameter equals three_2016 and the right parameter equals three_2015. Assign the result to merged_left_updated.
- Based on the results of this exercise, when using a left join, does changing the dataframe assigned to the left and right parameters change the result? Try to answer this question before moving onto the next screen.

In [None]:
three_2015 = happiness2015[['Country','Happiness Rank','Year']].iloc[2:5]
three_2016 = happiness2016[['Country','Happiness Rank','Year']].iloc[2:5]
merged = pd.merge(left=three_2015, right=three_2016, on='Country')

merged_left = pd.merge(left=three_2015, right=three_2016, on='Country', how="left")
merged_left_updated = pd.merge(left=three_2016, right=three_2015, on='Country', how="left")

### Left Joins with the Merge Function
- Update merged to use the suffixes _2015 and _2016. Set the suffixes parameter to ('_2015', '_2016') in merge(). Assign the result to merged_suffixes.
- Update merged_updated to use the suffixes _2015 and _2016. Notice that the "left" dataframe is three_2016 and the "right" dataframe is three_2015. Assign the result to merged_updated_suffixes.

In [None]:
three_2015 = happiness2015[['Country','Happiness Rank','Year']].iloc[2:5]
three_2016 = happiness2016[['Country','Happiness Rank','Year']].iloc[2:5]
merged = pd.merge(left=three_2015, right=three_2016, how='left', on='Country')
merged_updated = pd.merge(left=three_2016, right=three_2015, how = 'left', on='Country')

merged_suffixes = pd.merge(left=three_2015, right=three_2016, how='left', on='Country', suffixes=("_2015", "_2016"))
merged_updated_suffixes = pd.merge(left=three_2016, right=three_2015, how = 'left', on='Country', suffixes=("_2016", "_2015"))

### Join on Index with the Merge Function
We've already saved four_2015 and three_2016. In this exercise, we'll use a left join to combine four_2015 and three_2016.

- Predict the number of rows and columns the resulting dataframe will have. Assign the number of rows to a variable called rows and the number of columns to a variable called columns.
- To change the join type used in merge_index to a left join, set the how parameter equal to 'left'. Save the result to merge_index_left.
- Update rows and columns so that each contains the correct number of rows and columns in merge_index_left.

In [None]:
four_2015 = happiness2015[['Country','Happiness Rank','Year']].iloc[2:6]
three_2016 = happiness2016[['Country','Happiness Rank','Year']].iloc[2:5]
merge_index = pd.merge(left = four_2015,right = three_2016, left_index = True, right_index = True, suffixes = ('_2015','_2016'))

merge_index_left = pd.merge(left = four_2015,right = three_2016, left_index = True, right_index = True, suffixes = ('_2015','_2016'), how="left")
rows = merge_index_left.shape[0]
columns = merge_index_left.shape[1]

### Challenge: Combine Data and Create a Visualization
We've already created a Year column in happiness2017 and renamed the Happiness.Score column to Happiness Score.

- Use either the pd.concat() function or the pd.merge() function to combine happiness2015, happiness2016, and happiness2017. Assign the result to combined.
    - Think about whether you need to combine the data horizontally or vertically in order to create a dataframe that can be grouped by year, and decide which function (pd.concat() or pd.merge()) you can use to combine the data.
- Use the df.pivot_table() method to create a pivot table from the combined dataframe. Set Year as the index and Happiness Score as the values. Assign the result to pivot_table_combined.
- Use the df.plot() method to create a bar chart of the results. Set the kind parameter to barh, the title to 'Mean Happiness Scores by Year', and the xlim parameter to (0,10).
- Try to answer the following question based on the results of this exercise: Did world happiness increase, decrease, or stay about the same from 2015 to 2017?

In [None]:
happiness2017.rename(columns={'Happiness.Score': 'Happiness Score'}, inplace=True)

combined = pd.concat([happiness2015, happiness2016, happiness2017])
pivot_table_combined = combined.pivot_table("Happiness Score", "Year")
_ = pivot_table_combined.plot(kind="barh", title="Mean Happiness Scores by Year", xlim=(0,10))