# `pandas` Part 1: this notebook is a first lesson on `pandas`
## The main objective of this tutorial is to introduce `pandas` and create some DataFrames
>- Pandas is one of, if not the, most popular modules for data analytics/science projects
>- We will pretty much be learning about pandas from here until the final


# Learning Objectives
## By the end of this tutorial you will be able to:
1. Import the `pandas` module and give it an alias
2. Define a pandas DataFrame and Series
3. Create a pandas DataFrame from scratch
4. Create a pandas DataFrame by reading an Excel file
5. Create a pandas DataFrame by reading a csv file
6. Examine your DataFrames using the `shape` and `head()` functions

## Files Needed for this lesson: `wine.csv`
>- Download this csv from your learning management system prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
>- Note the `as pd` is optional but is a common alias used for pandas and makes writing the code a bit easier
2. Create or load data into a pandas DataFrame or Series
>- In practice, you will likely be loading more datasets than creating but we will learn both
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Preliminary Setup: Working with Data Files
>- Check your working directory and move files if needed
>>- Your working directory is the folder where your Python notebook or scripts are being created and saved. In other words, where you are opening and saving files. In this class, your Jupyter or Colab notebooks.
>- If working in Google Colab...
>>- Mount Your Google Drive
```
from google.colab import drive
drive.mount('/content/drive/'
```
>>- Complete the code-along notebook, [Reading files into Colab](https://colab.research.google.com/drive/1XehT1fL7hH1oBn_foeyMXwg2Bw-grYaj?usp=sharing), for more detail on working with Google Colab and Drive
>>- Save csv and Excel files to a folder/directory you save your Colab notebooks

In [2]:
import os, pandas as pd

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
os.chdir('/content/drive/MyDrive/Files_for_pandas/')

In [4]:
os.listdir()

['winemag-data-130k-v2.csv',
 'wine.csv',
 'students.csv',
 'students100.xlsx',
 'GBvideos.csv',
 'CAvideos.csv',
 'DJIA.csv',
 'AAPL.csv',
 'vendor.csv',
 'customer.csv',
 'invoice.csv',
 'line.csv',
 'product.csv',
 'q2_3chart.png',
 'q2_4chart.png',
 '3.2Annual_ETF_Returns.png',
 '3.4BoxPlot.png',
 '3.6regplot.png',
 '3.7pairwisePlot.png',
 'lesson18_4.1_scatter.png',
 'Ecommerce.csv',
 'MovieSurvey.csv',
 'Untitled Diagram.drawio',
 'Smarket.csv',
 'Caravan.csv']

# Step 1: Import pandas and give it an alias

# Step 2: Create a pandas `DataFrame`
## Definition: a `DataFrame` is a table
>- A `DataFrame` is nothing different than an Excel table or table in a SQL database
>- A `DataFrame` contains rows/records and columns

### Let's make a `DataFrame` in the next cell with the `DataFrame` function

In [7]:
pd.DataFrame({'Yes': [50, 30, 21],
              'No': [131, 2, 59]})

Unnamed: 0,Yes,No
0,50,131
1,30,2
2,21,59


### Notes on the previous example:
1. We use the `pd.DataFrame({})` constructor to create a DataFrame from scratch
2. Note we used dictionary syntax where the keys are the column names and the values are the lists of values for either  'Yes' or 'No'
3. The numbers in the far left column are autogenerated index values
>- These values will uniquely identify every row/record in the DataFrame
>- We can specify our own index values with an index parameter after the dictionary
4. This is the most common way of constructing a DataFrame

### Make another `DataFrame` with string data
>- Suppose we are collecting feedback on several products
>- We can store the data from various customers/reviewers with a DataFrame

In [10]:
pd.DataFrame({'Bob': ['I Liked it.', 'It was awful'],
              'Sue': ['Pretty good', 'Bland']
              }
             )

Unnamed: 0,Bob,Sue
0,I Liked it.,Pretty good
1,It was awful,Bland


### Now add our own index values instead of the auto-generated numbers

In [11]:
pd.DataFrame({'Bob': ['I Liked it.', 'It was awful'],
              'Sue': ['Pretty good', 'Bland']
              },
             index= ['Product A', 'Product B']
             )

Unnamed: 0,Bob,Sue
Product A,I Liked it.,Pretty good
Product B,It was awful,Bland


# Step 2 (part b) with `Series`
## Definition: a `Series` is a sequence of data values
>- Essentially a `Series` can be thought of a single column of a `DataFrame`
>>- And a `DataFrame` can be thought of as a bunch of `Series` appended together

### Let's make a `Series` or two in the next few cells

In [12]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [13]:
pd.Series([1, 2, 3], index= ['2015 sales', '2016 sales', '2017 sales'])

2015 sales    1
2016 sales    2
2017 sales    3
dtype: int64

# Step 2 (part c) Read Data Into a DataFrame
>- Knowing how to create your own data can be useful
>- However, most of the time we will read data into a DataFrame from a csv or Excel file

## File Needed: `wine.csv`
>- Make sure you download this file from Canvas and place in your working directory
>- **Updated Google Colab Notes**
>>- Please see the video on working with Google Colab and loading files from Google Drive to supplement this tutorial on working with local files in Jupyter Notebook.

### Read the csv file with `pd.read_csv('fileName.csv'`)

In [14]:
wineReviews = pd.read_csv('winemag-data-130k-v2.csv')

### Check how many rows/records and columns are in the the `wine_reviews` DataFrame
>- Use `shape`

In [15]:
wineReviews.shape

(129971, 14)

### The output returned by `shape` tells us how many rows and columns are in our DataFrame
>- Number of rows: 129,971
>- Number of columns: 14

### Now view a sample of 5 rows of data with `head()`

In [19]:
wineReviews.head(2)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


### Notice how it looks like we have two index rows
>- This is because the csv file already had an index column but pandas did not automatically code that as the index
>- Similar to how we set the index in the DataFrames we created, we can set the `index_col` parameter when we read in data

In [20]:
wineReviews= pd.read_csv('wine.csv', index_col= 0 )

In [21]:
wineReviews.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
