# <center>Introduction to Pandas</center>

![](https://pandas.pydata.org/_static/pandas_logo.png)


## Installation

Simply,
```
pip install pandas
```


## Reading data from a CSV file

You can read data from a CSV file using the ``read_csv`` function. By default, it assumes that the fields are comma-separated.

In [1]:
# import pandas
import pandas as pd

>The `imdb.csv` dataset contains Highest Rated IMDb "Top 1000" Titles.

In [2]:
# load imdb dataset as pandas dataframe
df1 = pd.read_csv("../imdb_1000.csv")

FileNotFoundError: [Errno 2] No such file or directory: '../imdb_1000.csv'

In [None]:
# show first 5 rows of imdb_df
df1.head()

>The `bikes.csv` dataset contains information about the number of bicycles that used certain bicycle lanes in Montreal in the year 2012.

In [None]:
# load bikes dataset as pandas dataframe

d_parser = lambda x: pd.datetime.strptime(x, '%d/%m/%Y')
df2 = pd.read_csv("../bikes.csv",sep = ';', parse_dates = ['Date'], date_parser = d_parser)

In [3]:
# show first 3 rows of bikes_df
df2.head(3)

NameError: name 'df2' is not defined

## Selecting columns

When you read a CSV, you get a kind of object called a DataFrame, which is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

In [None]:
# list columns of imdb_df
df1.columns

In [None]:
# what are the datatypes of values in columns
df1.dtypes

In [4]:
# list first 5 movie titles
df1.loc[0:5,["title"]]

NameError: name 'df1' is not defined

In [None]:
# show only movie title and genre
df1[['title','genre']]

## Understanding columns

On the inside, the type of a column is ``pd.Series`` and pandas Series are internally numpy arrays. If you add ``.values`` to the end of any Series, you'll get its internal **numpy array**.

In [None]:
# show the type of duration column
df1['duration'].dtype

In [None]:
# show duration values of movies as numpy arrays
import numpy as np
s1 = np.array(df1['duration'])

print(s1)
print(type(s1))

## Applying functions to columns

Use `.apply` function to apply any function to each element of a column.

In [None]:
# convert all the movie titles to uppercase
to_uppercase = lambda x: x.upper()
df1['title'].apply(to_uppercase).head()

## Plotting a column

Use ``.plot()`` function!

In [None]:
df2.head()

In [None]:
# plot the bikers travelling to Berri1 over the year
import matplotlib.pyplot as plt
plt.plot(df2['Date'],df2['Berri1'])
plt.xlabel('Date')
plt.ylabel('Bikers travelling to Berri1')

plt.title("Bikers data")
plt.show()


In [4]:
# plot all the columns of bikes_df

import matplotlib.pyplot as plt
plt.plot(df2['Date'],df2.loc[0:,'Rachel / Papineau' : 'Pont_Jacques_Cartier'] , label = ['Rachel / Papineau', 'Berri1', 'Maisonneuve_2', 'Maisonneuve_1', 'Brébeuf', 'Parc', 'PierDup', 'CSC (Côte Sainte-Catherine)', 'Pont_Jacques_Cartier'])

plt.xlabel('Date')
plt.ylabel('Bikers travelling to various places')
plt.title("Bikers data")
#plt.legend()

plt.show()

NameError: name 'df2' is not defined

## Value counts

Get count of unique values in a particular column/Series.

In [None]:
# what are the unique genre in imdb_df?
genre_arr = pd.unique(df1['genre'])
genre_arr

In [None]:

# plotting value counts of unique genres as a bar chart
import numpy as np
genre_count = df1['genre'].value_counts()

df_temp = pd.DataFrame(genre_count)
n = len(df_temp.index)
left = np.arange(1,1*n+1,1)
height = list(df_temp['genre'])
tick_label = df_temp.index.values
plt.bar(left, height, tick_label = tick_label,
        width = 0.8, color = ['red', 'blue'])
plt.show()

In [None]:
# plotting value counts of unique genres as a pie chart
activities = df_temp.index.values

slices = list(df_temp['genre'])

plt.pie(slices, labels = activities, 
        startangle=90, shadow = True,
        radius = 1.2, autopct = '%1.1f%%')
plt.show()

## Index

### DATAFRAME = COLUMNS + INDEX + ND DATA

### SERIES = INDEX + 1-D DATA

**Index** or (**row labels**) is one of the fundamental data structure of pandas. It can be thought of as an **immutable array** and an **ordered set**.

> Every row is uniquely identified by its index value.

In [None]:
# show index of bikes_df
df2.index.values
#df2.index

In [None]:
# get row for date 2012-01-01

filt = (df2['Date'] == '2012-01-01')
df2[filt]


#### To get row by integer index:

Use ``.iloc[]`` for purely integer-location based indexing for selection by position.

In [None]:
# show 11th row of imdb_df using iloc
df1.iloc[10]

## Selecting rows where column has a particular value

In [None]:
# select only those movies where genre is adventure
filt = (df1['genre'] == 'Adventure')
df1.loc[filt]


In [None]:
# which genre has highest number of movies with star rating above 8 and duration more than 130 minutes?
filt1 = (df1['star_rating'] > 8) & (df1['duration'] > 130)
ans = df1.loc[filt1,'genre'].value_counts()
temp_df = pd.DataFrame(ans)
temp_df.index[0]

## Adding a new column to DataFrame

In [None]:
# add a weekday column to bikes_df
df2['Weekday'] = df2['Date'].dt.day_name()
df2

## Deleting an existing column from DataFrame

In [None]:
# remove column 'Unnamed: 1' from bikes_df
df2.drop( columns = ['Unnamed: 1'],inplace = True)
df2


## Deleting a row in DataFrame

In [None]:
# remove row no. 1 from bikes_df
df2.drop([0])

## Group By

Any groupby operation involves one of the following operations on the original object. They are −

- Splitting the Object

- Applying a function

- Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

- **Aggregation** − computing a summary statistic

- **Transformation** − perform some group-specific operation

- **Filtration** − discarding the data with some condition

In [None]:
# group imdb_df by movie genres
genre_grp = df1.groupby(['genre'])
genre_grp

In [None]:
# get crime movies group
genre_grp.get_group('Crime')

In [None]:
# get mean of movie durations for each group
duration_mean = genre_grp['duration'].mean()
duration_mean

In [None]:
# change duration of all movies in a particular genre to mean duration of the group
#def fun(df1):
#    df1['duration'] = int(duration_mean[string(df1['genre'])])
n = len(df1.index)
for i in range(n):
    df1.loc[i,'duration'] = duration_mean[df1.loc[i,'genre']]

df1

In [None]:
# drop groups/genres that do not have average movie duration greater than 120.
n = len(df1.index)
for i in range(n):
    if (df1.loc[i,'duration'] <= 120) : 
        df1.drop(i,inplace = True)

df1

In [None]:
# group weekday wise bikers count
weekday_grp = df2.groupby(['Weekday'])


In [None]:
# get weekday wise biker count
#lst = np.arange(0,7)
day_lst = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sum_lst = []
for i in range(7):
    sum_lst.append(weekday_grp.get_group(day_lst[i]).sum().sum())

sum_lst

In [None]:
# plot weekday wise biker count for 'Berri1'
Berri1_sum = []
for i in range(7):
   Berri1_sum.append(weekday_grp.get_group(day_lst[i]).sum().loc['Berri1'])

plt.xlabel('Weekday')
plt.ylabel('Bikers count')
plt.plot(day_lst,Berri1_sum)
plt.title('Berri1')
plt.show()

![](https://memegenerator.net/img/instances/500x/73988569/pythonpandas-is-easy-import-and-go.jpg)