# `pandas` Part 7: Combining Datasets with `concat()`

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Combine DataFrames and/or Series with `concat()`
2. Understand a multi-index
3. Reset an index with `reset_index()`
4. Perform descriptive analytics on a combined DataFrame

## Files Needed for this lesson:
>- `CAvideos.csv`
>- `GBvideos.csv`
>- Download this csv files from Canvas prior to the lesson
>- C:\\Users\\mimc2537\\OneDrive - UCB-O365\\python\\pandas\\

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

Narration videos:
    
- Part 1: https://youtu.be/UAeRt1OD1Tc
- Part 2: https://youtu.be/10Rod_hxRZs

# Introduction Notes on Combining Data Using `pandas`
1. Being able to combine data from multiple sources is a critical skill for analytics professionals
2. We will learn the `pandas` way of combining data but there are similarities here to SQL
3. Why combine data with `pandas` if you can do the same thing in SQL?
>- The answer to this depends on the project
>- Some projects may be completed more efficiently all with `pandas` so you wouldn't necessarily need SQL
>- For some projects incorporating SQL into our python code makes sense
>- In a an analytics job, you will likely use both python and SQL to get the job done! 

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file names: 
>>- `CAvideos.csv`
>>- `GBvideos.csv`

### Check how many rows and columns are in our DataFrames

### Check a couple of rows of data in one of the new DataFrames

## Check the datatypes

# Combining DataFrames
>- The three common ways to combine datastest in pandas is with `concat()`, `join()`, and `merge()`
>- `concat()` will take two DataFrames or Series and append them together
>>- This is basically taking DataFrames and stacking their data on top of each other into one DataFrame
>>- For `concat()` you need the columns/fields in both DataFrames to the be the same
>- `join()` "links" DataFrames together based on a common field/column between the two
>- `merge()` also links DataFrames together based on common field/columns but with different syntax.
>>- We will cover the most basic join in this class
>>- A more in depth study of joins is provided in SQL focused courses
>>- Pandas join reference for further study: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html


# Using the YouTube DataFrames to practice combining data with pandas
>- The YouTube datasets store data on various YouTube trending statistics
>- Our example datasets show several months of data and daily trending YouTube videos.

>- For more information and for other YouTube datasets see the following link:
>>- https://www.kaggle.com/datasnaek/youtube-new

### First, creating a new DataFrame that appends the Canadian and British YouTube DataFrames

#### Some notes on the previous code
>- Line 1: We define a new DataFrame named `CanUK` which is defined as the concatenation,`concat()`, of two datasets
>>- Dataset 1 = canadian_youtube
>>- Dataset 2 = uk_youtube
>>- The `concat()` function takes the two (or more if applicable) DataFrames and "stacks" them on top of each
>- Line 2: We use `keys` option to define a multi-index (aka hierarchical index)
>>- Because our datasets represent YouTube videos from different countries we pass the abbreviated names of those countries as a list to `keys`
>>- Enter the keys names in order they appear in line 1 (e.g., 'can' first, 'uk' second)
>- Line 3: We use the `names` option to label our index columns from line 2
>>- Without the `names` option we would not have anything above our index columns

### Check the index for any dataframe using `DataFrame.index`
>- Note how `concat()` uses the rowid's for each country's dataset versus continuing the count

### Take a look at our new DataFrame

#### Did using `concat()` work to append the two DataFrames together? 
>- Check the shape of your new DataFrame
>- Compare the number of records to each one individually
>>- canadian_youtube = 40881 records
>>- uk_youtube = 38916 records
>>- 40881 + 38916 = 79797 total records

#### `reset_index`:
##### Note: You can reset a an index with `reset_index` 
>- This can be useful for some situations
>- For a multi-index you can pass the `level` option and specify what index you want to reset
>- Note: To make the change to our current DataFrame we would need to use the option, `inplace=True`

# Now some descriptive analytics

### What channels have the most trending videos?

### What are the quantitative descriptive statistics for TheEllenShow?

##### Alternatively, you can use `loc[]` to peform the filtering operation
>- The use of `where()` or `loc[]` depends on the question/purpose or sometimes just personal preference

### What were the total YouTube videos, total views, likes and dislikes for TheEllenShow?
>- Using the agg() function to calculate specific aggregations on different columns

## What are the totals for TheEllenShow's top 5 most viewed videos?
>- Only include the title names as part of the output (not channel or any other categorical fields)
>- Include total views, likes, dislikes, and comment count in the output

# Some Notes on the Previous Example
>- Our pandas code in the previous example is similar to SQL in the following ways
    1. `loc[CanUk.channel_title == 'TheEllenShow',` is SQL equivalent to `WHERE channel_title = 'TheEllenShow'`
    2. `['title','views','likes','dislikes','comment_count']` is SQL equivalent to:
        `SELECT title, sum(views),sum(likes),sum(dislikes),sum(comment_count)`
    3. `groupby(['title`]) is SQL equivalent to GROUP BY title
    4. Now in pandas we enter the aggregation after the `groupby()`, in this example `sum()`
      >>- In SQL we write the aggregation in the SELECT statement
      
## In future lessons we will continue to learn how pandas and SQL relate