[View in Colaboratory](https://colab.research.google.com/github/sayangdiptochakraborty/Assignment-3/blob/sayangdiptochakraborty/sayangdiptochakraborty.ipynb)

# Pandas

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.In this tutorial, we will learn the various features of Python Pandas and how to use them in practice.


## Import pandas and numpy

In [0]:
import pandas as pd
import numpy as np

### This is your playground feel free to explore other functions on pandas

#### Create Series from numpy array, list and dict

Don't know what a series is?

[Series Doc](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.html)

In [2]:
a_ascii = ord('A')
z_ascii = ord('Z')
alphabets = [chr(i) for i in range(a_ascii, z_ascii+1)]

print(alphabets)

numbers = np.arange(26)

print(numbers)

print(type(alphabets), type(numbers))

alpha_numbers = dict(zip(alphabets, numbers))

print(alpha_numbers)

print(type(alpha_numbers))

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25]
<class 'list'> <class 'numpy.ndarray'>
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9, 'K': 10, 'L': 11, 'M': 12, 'N': 13, 'O': 14, 'P': 15, 'Q': 16, 'R': 17, 'S': 18, 'T': 19, 'U': 20, 'V': 21, 'W': 22, 'X': 23, 'Y': 24, 'Z': 25}
<class 'dict'>


In [3]:
series1 = pd.Series(alphabets)
print(series1)

0     A
1     B
2     C
3     D
4     E
5     F
6     G
7     H
8     I
9     J
10    K
11    L
12    M
13    N
14    O
15    P
16    Q
17    R
18    S
19    T
20    U
21    V
22    W
23    X
24    Y
25    Z
dtype: object


In [4]:
series2 = pd.Series(numbers)
print(series2)

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
dtype: int64


In [5]:
series3 = pd.Series(alpha_numbers)
print(series3)

A     0
B     1
C     2
D     3
E     4
F     5
G     6
H     7
I     8
J     9
K    10
L    11
M    12
N    13
O    14
P    15
Q    16
R    17
S    18
T    19
U    20
V    21
W    22
X    23
Y    24
Z    25
dtype: int64


In [6]:
#replace head() with head(n) where n can be any number between [0-25] and observe the output in deach case 
series3.head(23)

A     0
B     1
C     2
D     3
E     4
F     5
G     6
H     7
I     8
J     9
K    10
L    11
M    12
N    13
O    14
P    15
Q    16
R    17
S    18
T    19
U    20
V    21
W    22
dtype: int64

#### Create DataFrame from lists

[DataFrame Doc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

In [7]:
data = {'alphabets': alphabets, 'values': numbers}

df = pd.DataFrame(data)

#Lets Change the column `values` to `alpha_numbers`

df.columns = ['alphabets', 'alpha_numbers']

df

Unnamed: 0,alphabets,alpha_numbers
0,A,0
1,B,1
2,C,2
3,D,3
4,E,4
5,F,5
6,G,6
7,H,7
8,I,8
9,J,9


In [8]:
# transpose

df.T

# there are many more operations which we can perform look at the documentation with the subsequent exercises we will learn more

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
alphabets,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
alpha_numbers,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25


#### Extract Items from a series

In [9]:
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

vowels = ser.take(pos)

df = pd.DataFrame(vowels)#, columns=['vowels'])

df.columns = ['vowels']

df.index = [0, 1, 2, 3, 4]

df

Unnamed: 0,vowels
0,a
1,e
2,i
3,o
4,u


#### Change the first character of each word to upper case in each word of ser

In [10]:
ser = pd.Series(['we', 'are', 'learning', 'pandas'])

ser.map(lambda x : x.title())

titles = [i.title() for i in ser]

titles

['We', 'Are', 'Learning', 'Pandas']

#### Reindexing

In [11]:
my_index = [1, 2, 3, 4, 5]

df1 = pd.DataFrame({'upper values': ['A', 'B', 'C', 'D', 'E'],
                   'lower values': ['a', 'b', 'c', 'd', 'e']},
                  index = my_index)

df1

Unnamed: 0,lower values,upper values
1,a,A
2,b,B
3,c,C
4,d,D
5,e,E


In [12]:
new_index = [2, 5, 4, 3, 1]

df1.reindex(index = new_index)

Unnamed: 0,lower values,upper values
2,b,B
5,e,E
4,d,D
3,c,C
1,a,A


# Get to know your Data


#### Import necessary modules


In [0]:
import pandas as pd
import numpy as np

#### Loading CSV Data to a DataFrame

In [0]:
iris_df = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')


#### See the top 10 rows


In [0]:
iris_df.head()

#### Find number of rows and columns


In [0]:
print(iris_df.shape)

#first is row and second is column
#select row by simple indexing

#print(iris_df.shape[0])
#print(iris_df.shape[1])

#### Print all columns

In [0]:
print(iris_df.columns)

#### Check Index


In [0]:
print(iris_df.index)

#### Right now the iris_data set has all the species grouped together let's shuffle it

In [0]:
#generate a random permutaion on index

print(iris_df.head())

new_index = np.random.permutation(iris_df.index)
iris_df = iris_df.reindex(index = new_index)

print(iris_df.head())

#### We can also apply an operation on whole column of iris_df

In [0]:
#original

print(iris_df.head())

iris_df['sepal_width'] *= 10

#changed

print(iris_df.head())

#lets undo the operation

iris_df['sepal_width'] /= 10

print(iris_df.head())

#### Show all the rows where sepal_width > 3.3

In [0]:
iris_df[iris_df['sepal_width']>3.3]

#### Club two filters together - Find all samples where sepal_width > 3.3 and species is versicolor

In [0]:
iris_df[(iris_df['sepal_width']>3.3) & (iris_df['species'] == 'versicolor')] 

#### Sorting a column by value

In [0]:
iris_df.sort_values(by='sepal_width')#, ascending = False)
#pass ascending = False for descending order

#### List all the unique species

In [0]:
species = iris_df['species'].unique()

print(species)

#### Selecting a particular species using boolean mask (learnt in previous exercise)

In [0]:
setosa = iris_df[iris_df['species'] == species[0]]

setosa.head()

In [0]:
# do the same for other 2 species 
versicolor = iris_df[iris_df['species'] == species[1]]

versicolor.head()

In [0]:


virginica = iris_df[iris_df['species'] == species[2]]

virginica.head()

#### Describe each created species to see the difference



In [0]:
setosa.describe()

In [0]:
versicolor.describe()

In [0]:
virginica.describe()

#### Let's plot and see the difference

##### import matplotlib.pyplot 

In [0]:
import matplotlib.pyplot as plt

#hist creates a histogram there are many more plots(see the documentation) you can play with it.

plt.hist(setosa['sepal_length'])
plt.hist(versicolor['sepal_length'])
plt.hist(virginica['sepal_length'])

# Pandas Exercise :


#### import necessary modules

In [0]:
import numpy as np
import pandas as pd

#### Load url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data" to a dataframe named wine_df

This is a wine dataset



#### print first five rows

#### assign wine_df to a different variable wine_df_copy and then delete all odd rows of wine_df_copy

[Hint](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)

#### Assign the columns as below:

The attributes are (dontated by Riccardo Leardi, riclea '@' anchem.unige.it):  
1) alcohol  
2) malic_acid  
3) alcalinity_of_ash  
4) magnesium  
5) flavanoids  
6) proanthocyanins  
7) hue  

#### Set the values of the first 3 rows from alcohol as NaN

Hint- Use iloc to select 3 rows of wine_df

#### Create an array of 10 random numbers uptill 10 and assign it to a  variable named `random`

#### Use random numbers you generated as an index and assign NaN value to each of cell of the column alcohol

####  How many missing values do we have? 

Hint: you can use isnull() and sum()

#### Delete the rows that contain missing values 

### BONUS: Play with the data set below