## What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

## Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

**Data Science: is a branch of computer science where we study how to store, use and analyze data for deriving information from it.**

### Import Pandas
Import pandas in your applications by adding the import keyword:

Pandas is usually imported under the pd alias.

In [1]:
import pandas as pd

# we can create a pandas dataframe using a dictionary.

mydataset = {
  'cars': ["BMW", "Volvo", "Ford", "Benz", "Audi"],
  'passings': [3, 7, 2, 5, 1]
}

myvar_0 = pd.DataFrame(mydataset)

myvar_0

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2
3,Benz,5
4,Audi,1


### Checking Pandas Version
The version string is stored under __version__ attribute.

In [None]:
print(pd.__version__)

1.3.5


### Pandas Series

#### What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [3]:
#Create a simple Pandas Series from a list:

a = [1, 7, 2]

myvar_1 = pd.Series(a)

myvar_1

0    1
1    7
2    2
dtype: int64

In [4]:
#Create a pandas dataframe from a dict

pd_dict = {'clubs': ["Arsenal", "Barcelona", "Chelsea", "Dortmund", "Everton"],
            'price_in_millions': [250, ]
           }

pd_df = pd.DataFrame(pd_dict, index = ["A","B","C","D","E"])

pd_df

Unnamed: 0,clubs
A,Arsenal
B,Barcelona
C,Chelsea
D,Dortmund
E,Everton


In [4]:
my_dict = {"A": "Arsenal", "B": "Bayern", "C": "Chelsea"}

pd_series1 = pd.Series(my_dict)

pd_series1

A    Arsenal
B     Bayern
C    Chelsea
dtype: object

### Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [None]:
#Return the first value of the Series:

pd_series.iloc[2:5]

Unnamed: 0,clubs
C,Chelsea
D,Dortmund
E,Everton


### Create Labels
With the index argument, you can name your own labels.

In [None]:
a = [1, 7, 2,4,5, 1, 0, 6]

myvar_2 = pd.Series(a, index = ["s", "t", "u","v", "w", "x", "y", "Z"])

myvar_2

s    1
t    7
u    2
v    4
w    5
x    1
y    0
Z    6
dtype: int64

When you have created labels, you can access an item by referring to the label.

In [None]:
print(myvar_2["t"])

t    7
t    6
dtype: int64


### Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

Note: The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.



In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390, "day4": 410}

myvar = pd.Series(calories, index = ["day1", "day2", "day3"])

myvar

day1    420
day2    380
day3    390
dtype: int64

#### What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

In [5]:
data = {
  "calories": [420, 380, 390, 410, 370],
  "duration": [50, 40, 45, 55, 40]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45
3,410,55
4,370,40


### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [21]:
#refer to the row index:
df.loc[[0,1,2,3,4], ["duration"]]

Unnamed: 0,duration
0,50
1,40
2,45
3,55
4,40


Return row 0 and 1:

Note: When using [], the result is a Pandas DataFrame.

In [17]:
#use a list of indexes:
df.loc[[0, 1]]

Unnamed: 0,calories,duration
0,420,50
1,380,40


Named Indexes
With the index argument, you can name your own indexes.

In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 

In [None]:
from google.colab import files
uploaded = files.upload()


Saving Movie Data Project.csv to Movie Data Project.csv


In [None]:
import io
df2 = pd.read_csv(io.BytesIO(uploaded['Movie Data Project.csv']))
# Dataset is now stored in a Pandas Dataframe

In [None]:
df2.head()

Unnamed: 0,Movie Title,Release Date,Genre (1),Director (1),Cast (1),Budget ($),Box Office Revenue ($),Profit
0,10 Cloverfield Lane,2016-03-08,Thriller,Dan Trachtenberg,Mary Elizabeth Winstead,15000000.0,108300000.0,93300000.0
1,13 Hours: The Secret Soldiers of Benghazi,2016-01-15,Action,Michael Bay,James Badge Dale,45000000.0,69400000.0,24400000.0
2,2 Guns,2013-08-02,Action,Baltasar Kormákur,Mark Wahlberg,61000000.0,131900000.0,70900000.0
3,21 Jump Street,2012-03-16,Comedy,Phil Lord,Jonah Hill,55000000.0,201500000.0,146500000.0
4,22 Jump Street,2014-06-04,Action,Phil Lord,Channing Tatum,84500000.0,331300000.0,246800000.0
