# <center>Introduction to Pandas</center>

![](https://pandas.pydata.org/_static/pandas_logo.png)


## Installation

Simply,
```
pip install pandas
```


## Reading data from a CSV file

You can read data from a CSV file using the ``read_csv`` function. By default, it assumes that the fields are comma-separated.

In [2]:
# import pandas
import pandas as pd

>The `imdb.csv` dataset contains Highest Rated IMDb "Top 1000" Titles.

In [3]:
# load imdb dataset as pandas dataframe
df = pd.read_csv('imdb_1000.csv')

In [15]:
# show first 5 rows of imdb_df
print(df.iloc[0:5])

   star_rating                     title content_rating   genre  duration  \
0          9.3  The Shawshank Redemption              R   Crime       142   
1          9.2             The Godfather              R   Crime       175   
2          9.1    The Godfather: Part II              R   Crime       200   
3          9.0           The Dark Knight          PG-13  Action       152   
4          8.9              Pulp Fiction              R   Crime       154   

                                         actors_list  
0  [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...  
1    [u'Marlon Brando', u'Al Pacino', u'James Caan']  
2  [u'Al Pacino', u'Robert De Niro', u'Robert Duv...  
3  [u'Christian Bale', u'Heath Ledger', u'Aaron E...  
4  [u'John Travolta', u'Uma Thurman', u'Samuel L....  


>The `bikes.csv` dataset contains information about the number of bicycles that used certain bicycle lanes in Montreal in the year 2012.

In [47]:
# load bikes dataset as pandas dataframe
df2 = pd.read_csv('bikes.csv')

In [46]:
# show first 3 rows of bikes_df
print(df2.iloc[0:3])

  Date;;Rachel / Papineau;Berri1;Maisonneuve_2;Maisonneuve_1;Brébeuf;Parc;PierDup;CSC (Côte Sainte-Catherine);Pont_Jacques_Cartier
0          01/01/2012;00:00;16;35;51;38;5;26;10;0;27                                                                              
1         02/01/2012;00:00;43;83;153;68;11;53;6;1;21                                                                              
2        03/01/2012;00:00;58;135;248;104;2;89;3;2;15                                                                              


## Selecting columns

When you read a CSV, you get a kind of object called a DataFrame, which is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

In [20]:
# list columns of imdb_df
columnList = []
df = pd.read_csv('imdb_1000.csv')
for col in df.columns:
    columnList.append(col)
print(columnList)


['star_rating', 'title', 'content_rating', 'genre', 'duration', 'actors_list']


In [22]:
# what are the datatypes of values in columns
print(df.columns.dtype)

object


In [31]:
# list first 5 movie titles
titleList = []
for i in df.title[0:5]:
    titleList.append(i)
print(titleList)

['The Shawshank Redemption', 'The Godfather', 'The Godfather: Part II', 'The Dark Knight', 'Pulp Fiction']


In [33]:
# show only movie title and genre
df[["title", "genre"]]

Unnamed: 0,title,genre
0,The Shawshank Redemption,Crime
1,The Godfather,Crime
2,The Godfather: Part II,Crime
3,The Dark Knight,Action
4,Pulp Fiction,Crime
5,12 Angry Men,Drama
6,"The Good, the Bad and the Ugly",Western
7,The Lord of the Rings: The Return of the King,Adventure
8,Schindler's List,Biography
9,Fight Club,Drama


## Understanding columns

On the inside, the type of a column is ``pd.Series`` and pandas Series are internally numpy arrays. If you add ``.values`` to the end of any Series, you'll get its internal **numpy array**.

In [35]:
# show the type of duration column
print(df["duration"].dtype)

int64


In [39]:
# show duration values of movies as numpy arrays
import numpy as np
arr = np.array([df['duration']])
arr

array([[142, 175, 200, 152, 154,  96, 161, 201, 195, 139, 178, 148, 124,
        142, 179, 169, 133, 207, 146, 121, 136, 130, 130, 106, 127, 116,
        175, 118, 110,  87, 125, 112, 102, 107, 119,  87, 169, 115, 112,
        109, 189, 110, 150, 165, 155, 137, 113, 165,  95, 151, 155, 153,
        125, 130, 116,  89, 137, 117,  88, 165, 170,  89, 146,  99,  98,
        116, 156, 122, 149, 134, 122, 136, 157, 123, 119, 137, 128, 120,
        229, 107, 134, 103, 177, 129, 102, 216, 136,  93,  68, 189,  99,
        108, 113, 181, 103, 138, 110, 129,  88, 160, 126,  91, 116, 125,
        143,  93, 102, 132, 153, 183, 160, 120, 138, 140, 153, 170, 129,
         81, 127, 131, 172, 115, 108, 107, 129, 156,  96,  91,  95, 162,
        130,  86, 186, 151,  96, 170, 118, 161, 131, 126, 131, 129, 224,
        180, 105, 117, 140, 119, 124, 130, 139, 107, 132, 117, 126, 122,
        178, 238, 149, 172,  98, 116, 116, 123, 148, 123, 182,  92,  93,
        100, 135, 105,  94, 140,  83,  95,  98, 143

## Applying functions to columns

Use `.apply` function to apply any function to each element of a column.

In [42]:
# convert all the movie titles to uppercase
for i in df['title']:
    print(i.upper())

THE SHAWSHANK REDEMPTION
THE GODFATHER
THE GODFATHER: PART II
THE DARK KNIGHT
PULP FICTION
12 ANGRY MEN
THE GOOD, THE BAD AND THE UGLY
THE LORD OF THE RINGS: THE RETURN OF THE KING
SCHINDLER'S LIST
FIGHT CLUB
THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING
INCEPTION
STAR WARS: EPISODE V - THE EMPIRE STRIKES BACK
FORREST GUMP
THE LORD OF THE RINGS: THE TWO TOWERS
INTERSTELLAR
ONE FLEW OVER THE CUCKOO'S NEST
SEVEN SAMURAI
GOODFELLAS
STAR WARS
THE MATRIX
CITY OF GOD
IT'S A WONDERFUL LIFE
THE USUAL SUSPECTS
SE7EN
LIFE IS BEAUTIFUL
ONCE UPON A TIME IN THE WEST
THE SILENCE OF THE LAMBS
LEON: THE PROFESSIONAL
CITY LIGHTS
SPIRITED AWAY
THE INTOUCHABLES
CASABLANCA
WHIPLASH
AMERICAN HISTORY X
MODERN TIMES
SAVING PRIVATE RYAN
RAIDERS OF THE LOST ARK
REAR WINDOW
PSYCHO
THE GREEN MILE
SUNSET BLVD.
THE PIANIST
THE DARK KNIGHT RISES
GLADIATOR
TERMINATOR 2: JUDGMENT DAY
MEMENTO
TAARE ZAMEEN PAR
DR. STRANGELOVE OR: HOW I LEARNED TO STOP WORRYING AND LOVE THE BOMB
THE DEPARTED
CINEMA PARADISO
APOCALYP

## Plotting a column

Use ``.plot()`` function!

In [8]:
# plot the bikers travelling to Berri1 over the year
df2 = pd.read_csv("bikes.csv")


In [None]:
# plot all the columns of bikes_df

## Value counts

Get count of unique values in a particular column/Series.

In [55]:
# what are the unique genre in imdb_df?
df['genre'].unique()

array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

In [None]:
# plotting value counts of unique genres as a bar chart

In [None]:
# plotting value counts of unique genres as a pie chart

## Index

### DATAFRAME = COLUMNS + INDEX + ND DATA

### SERIES = INDEX + 1-D DATA

**Index** or (**row labels**) is one of the fundamental data structure of pandas. It can be thought of as an **immutable array** and an **ordered set**.

> Every row is uniquely identified by its index value.

In [57]:
# show index of bikes_df
df2.index


RangeIndex(start=0, stop=366, step=1)

In [None]:
# get row for date 2012-01-01

#### To get row by integer index:

Use ``.iloc[]`` for purely integer-location based indexing for selection by position.

In [58]:
# show 11th row of imdb_df using iloc
df.iloc[11]

star_rating                                                     8.8
title                                                     Inception
content_rating                                                PG-13
genre                                                        Action
duration                                                        148
actors_list       [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
Name: 11, dtype: object

## Selecting rows where column has a particular value

In [66]:
# select only those movies where genre is adventure
newdf = df[(df['genre'] == "Adventure")]
newdf

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
10,8.8,The Lord of the Rings: The Fellowship of the Ring,PG-13,Adventure,178,"[u'Elijah Wood', u'Ian McKellen', u'Orlando Bl..."
14,8.8,The Lord of the Rings: The Two Towers,PG-13,Adventure,179,"[u'Elijah Wood', u'Ian McKellen', u'Viggo Mort..."
15,8.7,Interstellar,PG-13,Adventure,169,"[u'Matthew McConaughey', u'Anne Hathaway', u'J..."
54,8.5,Back to the Future,PG,Adventure,116,"[u'Michael J. Fox', u'Christopher Lloyd', u'Le..."
68,8.4,Das Boot,R,Adventure,149,"[u'J\xfcrgen Prochnow', u'Herbert Gr\xf6nemeye..."
71,8.4,North by Northwest,APPROVED,Adventure,136,"[u'Cary Grant', u'Eva Marie Saint', u'James Ma..."
85,8.4,Lawrence of Arabia,PG,Adventure,216,"[u""Peter O'Toole"", u'Alec Guinness', u'Anthony..."
101,8.3,Monty Python and the Holy Grail,PG,Adventure,91,"[u'Graham Chapman', u'John Cleese', u'Eric Idle']"
114,8.3,Inglourious Basterds,R,Adventure,153,"[u'Brad Pitt', u'Diane Kruger', u'Eli Roth']"


In [67]:
# which genre has highest number of movies with star rating above 8 and duration more than 130 minutes?
newdf = df[(df['star_rating']>8) & (df['duration']>130)]
newdf

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
6,8.9,"The Good, the Bad and the Ugly",NOT RATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
8,8.9,Schindler's List,R,Biography,195,"[u'Liam Neeson', u'Ralph Fiennes', u'Ben Kings..."
9,8.9,Fight Club,R,Drama,139,"[u'Brad Pitt', u'Edward Norton', u'Helena Bonh..."
10,8.8,The Lord of the Rings: The Fellowship of the Ring,PG-13,Adventure,178,"[u'Elijah Wood', u'Ian McKellen', u'Orlando Bl..."


## Adding a new column to DataFrame

In [None]:
# add a weekday column to bikes_df

## Deleting an existing column from DataFrame

In [None]:
# remove column 'Unnamed: 1' from bikes_df

## Deleting a row in DataFrame

In [None]:
# remove row no. 1 from bikes_df

## Group By

Any groupby operation involves one of the following operations on the original object. They are −

- Splitting the Object

- Applying a function

- Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

- **Aggregation** − computing a summary statistic

- **Transformation** − perform some group-specific operation

- **Filtration** − discarding the data with some condition

In [69]:
# group imdb_df by movie genres
newdf = df.groupby('genre')
newdf

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001569F97A208>

In [6]:
# get crime movies group

In [7]:
# get mean of movie durations for each group
newdf = df.groupby('duration')
newdf.mean()

Unnamed: 0_level_0,star_rating
duration,Unnamed: 1_level_1
64,8.000000
66,8.000000
67,8.100000
68,8.250000
69,7.600000
70,8.000000
75,7.650000
76,8.000000
78,7.633333
79,7.800000


In [None]:
# change duration of all movies in a particular genre to mean duration of the group

In [None]:
# drop groups/genres that do not have average movie duration greater than 120.

In [None]:
# group weekday wise bikers count

In [None]:
# get weekday wise biker count

In [None]:
# plot weekday wise biker count for 'Berri1'

![](https://memegenerator.net/img/instances/500x/73988569/pythonpandas-is-easy-import-and-go.jpg)