# <center>Introduction to Pandas</center>

![](https://pandas.pydata.org/_static/pandas_logo.png)


## Installation

Simply,
```
pip install pandas
```


## Reading data from a CSV file

You can read data from a CSV file using the ``read_csv`` function. By default, it assumes that the fields are comma-separated.

In [None]:
# import pandas
import pandas as pd
import matplotlib.pyplot as plt

>The `imdb.csv` dataset contains Highest Rated IMDb "Top 1000" Titles.

In [2]:
# load imdb dataset as pandas dataframe
import pandas as pd
df = pd.read_csv(r'C:\Users\prafu\Downloads\imdb_1000.csv')

In [5]:
# show first 5 rows of imdb_df
result=df.head(5)
print (result)

   star_rating                     title content_rating   genre  duration  \
0          9.3  The Shawshank Redemption              R   Crime       142   
1          9.2             The Godfather              R   Crime       175   
2          9.1    The Godfather: Part II              R   Crime       200   
3          9.0           The Dark Knight          PG-13  Action       152   
4          8.9              Pulp Fiction              R   Crime       154   

                                         actors_list  
0  [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...  
1    [u'Marlon Brando', u'Al Pacino', u'James Caan']  
2  [u'Al Pacino', u'Robert De Niro', u'Robert Duv...  
3  [u'Christian Bale', u'Heath Ledger', u'Aaron E...  
4  [u'John Travolta', u'Uma Thurman', u'Samuel L....  


>The `bikes.csv` dataset contains information about the number of bicycles that used certain bicycle lanes in Montreal in the year 2012.

In [6]:
# load bikes dataset as pandas dataframe
df1=pd.read_csv(r'C:\Users\prafu\Downloads\bikes.csv')

In [7]:
# show first 3 rows of bikes_df
result1=df1.head(3)
print (result1)

  Date;;Rachel / Papineau;Berri1;Maisonneuve_2;Maisonneuve_1;Brébeuf;Parc;PierDup;CSC (Côte Sainte-Catherine);Pont_Jacques_Cartier
0          01/01/2012;00:00;16;35;51;38;5;26;10;0;27                                                                              
1         02/01/2012;00:00;43;83;153;68;11;53;6;1;21                                                                              
2        03/01/2012;00:00;58;135;248;104;2;89;3;2;15                                                                              


## Selecting columns

When you read a CSV, you get a kind of object called a DataFrame, which is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

In [8]:
# list columns of imdb_df
print(df.columns)

Index(['star_rating', 'title', 'content_rating', 'genre', 'duration',
       'actors_list'],
      dtype='object')


In [9]:
# what are the datatypes of values in columns
df.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [16]:
# list first 5 movie titles
pj=df['title']
print(pj.head(5))

0    The Shawshank Redemption
1               The Godfather
2      The Godfather: Part II
3             The Dark Knight
4                Pulp Fiction
Name: title, dtype: object


In [13]:
# show only movie title and genre
result2 = df[['title', 'genre']]
print (result2)

                                               title      genre
0                           The Shawshank Redemption      Crime
1                                      The Godfather      Crime
2                             The Godfather: Part II      Crime
3                                    The Dark Knight     Action
4                                       Pulp Fiction      Crime
..                                               ...        ...
974                                          Tootsie     Comedy
975                      Back to the Future Part III  Adventure
976  Master and Commander: The Far Side of the World     Action
977                                      Poltergeist     Horror
978                                      Wall Street      Crime

[979 rows x 2 columns]


## Understanding columns

On the inside, the type of a column is ``pd.Series`` and pandas Series are internally numpy arrays. If you add ``.values`` to the end of any Series, you'll get its internal **numpy array**.

In [19]:
# show the type of duration column
result3=df['duration']
type(result3)

pandas.core.series.Series

In [20]:
# show duration values of movies as numpy arrays
result3.values

array([142, 175, 200, 152, 154,  96, 161, 201, 195, 139, 178, 148, 124,
       142, 179, 169, 133, 207, 146, 121, 136, 130, 130, 106, 127, 116,
       175, 118, 110,  87, 125, 112, 102, 107, 119,  87, 169, 115, 112,
       109, 189, 110, 150, 165, 155, 137, 113, 165,  95, 151, 155, 153,
       125, 130, 116,  89, 137, 117,  88, 165, 170,  89, 146,  99,  98,
       116, 156, 122, 149, 134, 122, 136, 157, 123, 119, 137, 128, 120,
       229, 107, 134, 103, 177, 129, 102, 216, 136,  93,  68, 189,  99,
       108, 113, 181, 103, 138, 110, 129,  88, 160, 126,  91, 116, 125,
       143,  93, 102, 132, 153, 183, 160, 120, 138, 140, 153, 170, 129,
        81, 127, 131, 172, 115, 108, 107, 129, 156,  96,  91,  95, 162,
       130,  86, 186, 151,  96, 170, 118, 161, 131, 126, 131, 129, 224,
       180, 105, 117, 140, 119, 124, 130, 139, 107, 132, 117, 126, 122,
       178, 238, 149, 172,  98, 116, 116, 123, 148, 123, 182,  92,  93,
       100, 135, 105,  94, 140,  83,  95,  98, 143,  99,  98, 12

## Applying functions to columns

Use `.apply` function to apply any function to each element of a column.

In [26]:
# convert all the movie titles to uppercase
pj=pj.str.upper()
print (pj)

0                             THE SHAWSHANK REDEMPTION
1                                        THE GODFATHER
2                               THE GODFATHER: PART II
3                                      THE DARK KNIGHT
4                                         PULP FICTION
                            ...                       
974                                            TOOTSIE
975                        BACK TO THE FUTURE PART III
976    MASTER AND COMMANDER: THE FAR SIDE OF THE WORLD
977                                        POLTERGEIST
978                                        WALL STREET
Name: title, Length: 979, dtype: object


## Plotting a column

Use ``.plot()`` function!

In [None]:
# plot the bikers travelling to Berri1 over the year

In [29]:
# plot all the columns of bikes_df
bik=df1.columns
print (bik)

Index(['Date;;Rachel / Papineau;Berri1;Maisonneuve_2;Maisonneuve_1;Brébeuf;Parc;PierDup;CSC (Côte Sainte-Catherine);Pont_Jacques_Cartier'], dtype='object')


## Value counts

Get count of unique values in a particular column/Series.

In [None]:
# what are the unique genre in imdb_df?

In [None]:
# plotting value counts of unique genres as a bar chart

In [None]:
# plotting value counts of unique genres as a pie chart

## Index

### DATAFRAME = COLUMNS + INDEX + ND DATA

### SERIES = INDEX + 1-D DATA

**Index** or (**row labels**) is one of the fundamental data structure of pandas. It can be thought of as an **immutable array** and an **ordered set**.

> Every row is uniquely identified by its index value.

In [None]:
# show index of bikes_df

In [None]:
# get row for date 2012-01-01

#### To get row by integer index:

Use ``.iloc[]`` for purely integer-location based indexing for selection by position.

In [None]:
# show 11th row of imdb_df using iloc

## Selecting rows where column has a particular value

In [None]:
# select only those movies where genre is adventure

In [None]:
# which genre has highest number of movies with star rating above 8 and duration more than 130 minutes?

## Adding a new column to DataFrame

In [None]:
# add a weekday column to bikes_df

## Deleting an existing column from DataFrame

In [None]:
# remove column 'Unnamed: 1' from bikes_df

## Deleting a row in DataFrame

In [None]:
# remove row no. 1 from bikes_df

## Group By

Any groupby operation involves one of the following operations on the original object. They are −

- Splitting the Object

- Applying a function

- Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

- **Aggregation** − computing a summary statistic

- **Transformation** − perform some group-specific operation

- **Filtration** − discarding the data with some condition

In [None]:
# group imdb_df by movie genres

In [None]:
# get crime movies group

In [None]:
# get mean of movie durations for each group

In [None]:
# change duration of all movies in a particular genre to mean duration of the group

In [None]:
# drop groups/genres that do not have average movie duration greater than 120.

In [None]:
# group weekday wise bikers count

In [None]:
# get weekday wise biker count

In [None]:
# plot weekday wise biker count for 'Berri1'

![](https://memegenerator.net/img/instances/500x/73988569/pythonpandas-is-easy-import-and-go.jpg)