# Week 3 - Introduction to Pandas

## Installing and importing a python package/library 

- Installing Pandas Library for Data Processing 
- What is Pandas? 
    - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
    - Its main data format is the DataFrame, which is like a 2D array, but with column names and indexes for the rows.
    - It is essentially the swiss army knife for your data. Using pandas, you get better undertand your data by cleaning, transforming, and analyzing it. 
    
    - For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:
      - Calculate statistics and answer questions about the data, like
      - What's the average, median, max, or min of each column? 
      - Does column A correlate with column B?
      - What does the distribution of data in column C look like?
      - Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
      - Store the cleaned, transformed data back into a CSV, other file or database
      
- Before you start visualizing data you need to have a good understanding of the nature of your dataset and pandas is the best package to do that.




### Using pip to install Pandas 

In [12]:
# Need to only be run once per environment. No need to run it again in every project.

import sys
!{sys.executable} -m pip install pandas



### Importing pandas package library 

In [2]:
# Imprting pandas library as pd so instead of wrting pandas everywhere we can call pd which is shorter
import pandas as pd

## Creating DataFrame from Scratch 

There are many ways to create a DataFrame from scratch, but to start with lets use a simple dictionary.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

Scratch Data Frame Example Adapated from https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

A dictionary is a collection which is ordered, changeable and do not allow duplicates.

For more on dictiornay, arrys and list read here https://python.plainenglish.io/arrays-vs-list-vs-dictionaries-47058fa19d4e

In [13]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

In [14]:
data

{'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2]}

And then pass it to the pandas DataFrame constructor:

In [4]:
purchases = pd.DataFrame(data)

purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


How did that work?
Each (key, value) item in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let's have customer names as our index:

In [5]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


So now we could locate a customer's order by using their name:

In [9]:
purchases.loc['June']

apples     3
oranges    0
Name: June, dtype: int64

## Loading and exploring data with Pandas



First we load in the data. This particular dataset is from [IMDB (Internet Movie Database)](https://www.kaggle.com/datasets/mustafacicek/imdb-top-250-lists-1996-2020?resource=download) and is in the **comma-separated variable (.csv)** format.

``
df = pd.read_csv("data/imdbTop250.csv")
``

Pandas shows us the **head** (first rows) and the **tail** (last rows), as well as the **shape**. So we know there are 16 columns (separate pieces of info about each film), and 6500 films.


In [15]:
#load
#pd.options.display.max_rows = 100
df = pd.read_csv("data/imdbTop250.csv")

In [16]:
df

Unnamed: 0,Ranking,IMDByear,IMDBlink,Title,Date,RunTime,Genre,Rating,Score,Votes,Gross,Director,Cast1,Cast2,Cast3,Cast4
0,1,1996,/title/tt0076759/,Star Wars: Episode IV - A New Hope,1977,121,"Action, Adventure, Fantasy",8.6,90.0,1299781,322.74,George Lucas,Mark Hamill,Harrison Ford,Carrie Fisher,Alec Guinness
1,2,1996,/title/tt0111161/,The Shawshank Redemption,1994,142,Drama,9.3,80.0,2529673,28.34,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler
2,3,1996,/title/tt0117951/,Trainspotting,1996,93,Drama,8.1,83.0,665213,16.50,Danny Boyle,Ewan McGregor,Ewen Bremner,Jonny Lee Miller,Kevin McKidd
3,4,1996,/title/tt0114814/,The Usual Suspects,1995,106,"Crime, Drama, Mystery",8.5,77.0,1045626,23.34,Bryan Singer,Kevin Spacey,Gabriel Byrne,Chazz Palminteri,Stephen Baldwin
4,5,1996,/title/tt0108598/,The Wrong Trousers,1993,30,"Animation, Short, Comedy",8.3,,53316,,Nick Park,Peter Sallis,Peter Hawkins,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6495,246,2021,/title/tt0058946/,The Battle of Algiers,1966,121,"Drama, War",8.1,96.0,57995,0.06,Gillo Pontecorvo,Brahim Hadjadj,Jean Martin,Yacef Saadi,Samia Kerbash
6496,247,2021,/title/tt0050783/,Nights of Cabiria,1957,110,Drama,8.1,,47318,0.75,Federico Fellini,Giulietta Masina,François Périer,Franca Marzi,Dorian Gray
6497,248,2021,/title/tt0093779/,The Princess Bride,1987,98,"Adventure, Family, Fantasy",8.1,77.0,416207,30.86,Rob Reiner,Cary Elwes,Mandy Patinkin,Robin Wright,Chris Sarandon
6498,249,2021,/title/tt7060344/,Raatchasan,2018,170,"Crime, Drama, Mystery",8.4,,37474,,Ram Kumar,Vishnu Vishal,Amala Paul,Radha Ravi,Sangili Murugan
