## Day 9 : Hands on Pandas Part-1 

**REFERENCE:** https://medium.datadriveninvestor.com/day-9-60-days-of-data-science-and-machine-learning-2d0f75a498b9

### What is Pandas ?
- It’s an open source Python package written for the Python programming language for data manipulation, analysis and ML tasks
- It is built on top of another package named Numpy, which provides support for mathematical computations and multi-dimensional arrays.

#### Importing Pandas Library
First you need to have Pandas Library downloaded in your system 
- *pip install pandas*

And then Import it in you Jupyter Notebook with the command below.

In [1]:
### Importing Pandas Library as pd (an alias name given to Pandas Library)
import pandas as pd

### Pandas Series
Pandas Series is a ***one-dimensional labeled array*** capable of holding data of any type (integer, string, float, python objects, etc.). Series in Pandas returns both values and indexes associated with it.

In [2]:
series1 = pd.Series([123, 234, 345, 456, 567, 678, 789, 890, 901])
series1

0    123
1    234
2    345
3    456
4    567
5    678
6    789
7    890
8    901
dtype: int64

In [3]:
### To check the type
type(series1)

pandas.core.series.Series

#### **Series.axes**
Returns a list of row axis labels of the given Series object.

In [4]:
series1.axes

[RangeIndex(start=0, stop=9, step=1)]

#### Checking the DataType of the Series

In [5]:
series1.dtype

dtype('int64')

#### **Series.size**
Size attribute returns the Number of Elements in the underlying Data for the given Series Objects.

In [6]:
series1.size

9

#### **Series.ndim**
Returns the Number of Dimensions of the underlying data, by definition it is 1 for Series Objects.

In [7]:
series1.ndim

1

#### **Series.values** 
Return Series as ndarray or ndarray-like depending on the dtype.

In [8]:
series1.values

array([123, 234, 345, 456, 567, 678, 789, 890, 901])

In [9]:
### We can also specify our Indexes in Strings/Objects.
series2 = pd.Series([1,2,4,5,6],index = ["First", "Zero", "Second", "Third", "Fourth"])
series2

First     1
Zero      2
Second    4
Third     5
Fourth    6
dtype: int64

> If we are using the string based indexes and if we run sort_index() throughout the series, then it will arrange the Series elements on the basis of alphabetically.

In [10]:
series2.sort_index()

First     1
Fourth    6
Second    4
Third     5
Zero      2
dtype: int64

### Creating Series with Dictionaries

In [11]:
ages = {'Andrew': 31, "Kate": 45, "Matthew": 26, "Helen": 19}           ### Keys act as Index
new_ages = pd.Series(ages)
new_ages

Andrew     31
Kate       45
Matthew    26
Helen      19
dtype: int64

- If we only want to select a Particular elements from the dictionary then we can use index.

In [12]:
ages = {'Andrew': 31, "Kate": 45, "Matthew": 26, "Helen": 19}
pd.Series(ages, index= ['Andrew', 'Helen'])

Andrew    31
Helen     19
dtype: int64

### Creating Pandas Series by Numpy Arrays

In [13]:
import numpy as np
nparray1 = np.array([1,2,3,4])
pd.Series(nparray1)

0    1
1    2
2    3
3    4
dtype: int64

### Merging Two Series (Concat)

In [14]:
s1 = pd.Series([2,3,55,2,6,44]) 
s2 = pd.Series([42,32,34,2,1,4,42])
pd.concat([s1,s2])

0     2
1     3
2    55
3     2
4     6
5    44
0    42
1    32
2    34
3     2
4     1
5     4
6    42
dtype: int64

- Here we will get 2 results as the index values are duplicated.

In [15]:
Merged_Series = pd.concat([s1,s2])
Merged_Series[0]

0     2
0    42
dtype: int64

In [16]:
### We can use Selection and use different selectors to Select Specific Elements from the Series.

series3 = pd.Series([11,12,13,14,15,16])
series3[0:3]

0    11
1    12
2    13
dtype: int64

### Pandas Dataframes

Pandas DataFrame is two-dimensional size-mutable, a heterogeneous (Data of ay Type) tabular data structure with labeled axes (rows and columns). 
- A Data frame is a two-dimensional data structure, i.e. data is aligned in a tabular fashion in rows and columns.

#### Creating a DataFrame

In [17]:
### Creating a DataFrame using a Dictionary.

names = {"Names":["Allen","Rob","Harold","Amy"],"Age":[21,11,13,15]} 

DataFrame1 = pd.DataFrame(names)
DataFrame1

Unnamed: 0,Names,Age
0,Allen,21
1,Rob,11
2,Harold,13
3,Amy,15


In [18]:
DataFrame1["Age"]           ### This is a Series --> TRY "type(DataFrame1["Age"])"

0    21
1    11
2    13
3    15
Name: Age, dtype: int64

In [19]:
### We can also Assign Column Name

var = [10,30,20,89,48,40]
DataFrame2 = pd.DataFrame(var, columns= ["Variables"])
DataFrame2

Unnamed: 0,Variables
0,10
1,30
2,20
3,89
4,48
5,40


#### We can also create DataFrames from Numpy

In [20]:
array1 = np.random.randint(10, size= (5,2))
array1

array([[9, 8],
       [5, 1],
       [5, 1],
       [6, 9],
       [3, 6]])

In [21]:
### We can assign them the columns name

DataFrame3 = pd.DataFrame(array1, columns= ["Var1","Var2"])
DataFrame3

Unnamed: 0,Var1,Var2
0,9,8
1,5,1
2,5,1
3,6,9
4,3,6


#### DataFrame.axes

Access a group of rows and columns by label(s) or a boolean array in the given DataFrame.

In [22]:
DataFrame3.axes

[RangeIndex(start=0, stop=5, step=1), Index(['Var1', 'Var2'], dtype='object')]

In [23]:
### To determine shape

DataFrame3.shape

(5, 2)

In [24]:
### Checking the Dimension of the DataFrame (BY DEFINITION THIS IS 2 DIMENSIONAL)

DataFrame3.ndim

2

#### Checking the total number of elements in the DataFrame

In [25]:
DataFrame3.size

10

#### Getting the Columns Names from the DataFrame

In [26]:
DataFrame3.columns

Index(['Var1', 'Var2'], dtype='object')

#### Index 
The index (row labels) of the DataFrame. It basically tells us that how many rows our DataFrame has.

In [27]:
DataFrame3.index

RangeIndex(start=0, stop=5, step=1)

#### Values 
Return a Numpy representation of the given DataFrame.

In [28]:
DataFrame3.values

array([[9, 8],
       [5, 1],
       [5, 1],
       [6, 9],
       [3, 6]])

#### Accessing the rows of the DataFrame

In [29]:
DataFrame4 = pd.DataFrame({"Name":["Josh", "Rachel", "Tim", "Kate", "Zach", "Andrew"],
                    "Age":[11, 13, 16, 12, 14, 18],
                    "Salary":[10000, 23000, 18000, 3900000, 19000, 24000]})

DataFrame4

Unnamed: 0,Name,Age,Salary
0,Josh,11,10000
1,Rachel,13,23000
2,Tim,16,18000
3,Kate,12,3900000
4,Zach,14,19000
5,Andrew,18,24000


In [30]:
DataFrame4.Age

0    11
1    13
2    16
3    12
4    14
5    18
Name: Age, dtype: int64

In [31]:
DataFrame4.Age[3]

12

#### Assigning a Value to a Specific Row

We are accessing the DataFrame using the ***iloc*** and ***loc*** and changing the values of the DataFrame

In [32]:
DataFrame4.iloc[2] = ["Ron", 15, 185]
DataFrame4

Unnamed: 0,Name,Age,Salary
0,Josh,11,10000
1,Rachel,13,23000
2,Ron,15,185
3,Kate,12,3900000
4,Zach,14,19000
5,Andrew,18,24000


#### Assigning Custom Indexes

In [33]:
### First we had to add a column for making the custom indexes.

roll_no = [112890,39080,18878,38788,9070,50830]

In [34]:
### Adding the Roll Number Column in the DataFrame

DataFrame4["Roll Number"] = roll_no
DataFrame4


Unnamed: 0,Name,Age,Salary,Roll Number
0,Josh,11,10000,112890
1,Rachel,13,23000,39080
2,Ron,15,185,18878
3,Kate,12,3900000,38788
4,Zach,14,19000,9070
5,Andrew,18,24000,50830


In [35]:
### Now Setting the Index on the Basis of Roll Number

DataFrame4.set_index("Roll Number",inplace = True)
DataFrame4

Unnamed: 0_level_0,Name,Age,Salary
Roll Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
112890,Josh,11,10000
39080,Rachel,13,23000
18878,Ron,15,185
38788,Kate,12,3900000
9070,Zach,14,19000
50830,Andrew,18,24000


> Now suppose if we want to locate any record or row based on roll number we can do that using the loc[] function.

In [36]:
DataFrame4.loc[9070]

Name       Zach
Age          14
Salary    19000
Name: 9070, dtype: object

### Sorting Indexes

In [37]:
DataFrame5 = pd.DataFrame({"Name":["Josh", "Rachel", "Tim", "Kate", "Zach", "Andrew"],
                        "Age":[11, 13, 16, 12, 14, 18],
                        "Salary":[10000, 23000, 18000, 3900000, 19000, 24000]}, index= [1, 89, 39, 36, 78, 54])

DataFrame5.sort_index(inplace=True)
DataFrame5

Unnamed: 0,Name,Age,Salary
1,Josh,11,10000
36,Kate,12,3900000
39,Tim,16,18000
54,Andrew,18,24000
78,Zach,14,19000
89,Rachel,13,23000


### Filtering in DataFrame

In [38]:
DataFrame6 = pd.DataFrame({"Name":["Josh", "Mike", "Julia", "Sergio"],
                          "Department":["IT", "Human Resources", "Finance", "Supply Chain"],
                          "Income":[4800, 5200, 6600, 5700], 
                          "Age":[24, 28, 33, 41]})
DataFrame6

Unnamed: 0,Name,Department,Income,Age
0,Josh,IT,4800,24
1,Mike,Human Resources,5200,28
2,Julia,Finance,6600,33
3,Sergio,Supply Chain,5700,41


In [39]:
### Now, if want to check according to Specific Department

DataFrame6["Department"] == "IT"

0     True
1    False
2    False
3    False
Name: Department, dtype: bool

> We can also use the loc[] Operator and it gives us the flexibility to choose from between various Departments

In [40]:
DataFrame6.loc[DataFrame6["Department"] == "IT", "Name"]

0    Josh
Name: Name, dtype: object

> Now if we want to know the salary of the employees based on some arithmetic conditions

In [41]:
DataFrame6[DataFrame6["Income"] > 5500]

Unnamed: 0,Name,Department,Income,Age
2,Julia,Finance,6600,33
3,Sergio,Supply Chain,5700,41


In [42]:
DataFrame6[(DataFrame6["Age"]>30) | (DataFrame6["Department"] == "Human Resources")]

Unnamed: 0,Name,Department,Income,Age
1,Mike,Human Resources,5200,28
2,Julia,Finance,6600,33
3,Sergio,Supply Chain,5700,41


> To get opposite of a filter use ~(Tilde) sign

In [43]:
DataFrame6[~(DataFrame6["Age"]<35)]

Unnamed: 0,Name,Department,Income,Age
3,Sergio,Supply Chain,5700,41


#### Filtering with Filter() Function

- **REFER:** https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html

In [44]:
DataFrame6.filter(items=["Department","Name","Income"])

Unnamed: 0,Department,Name,Income
0,IT,Josh,4800
1,Human Resources,Mike,5200
2,Finance,Julia,6600
3,Supply Chain,Sergio,5700


#### Adding Rows — append()

In [45]:
DataFrame6.dtypes

Name          object
Department    object
Income         int64
Age            int64
dtype: object

In [46]:
DataFrame6.append({"Name": "Romeo"}, ignore_index= True)

Unnamed: 0,Name,Department,Income,Age
0,Josh,IT,4800.0,24.0
1,Mike,Human Resources,5200.0,28.0
2,Julia,Finance,6600.0,33.0
3,Sergio,Supply Chain,5700.0,41.0
4,Romeo,,,


- It adds automatically to the end of dataframe. But we need to add all values, otherwise it gives nan.

In [47]:
DataFrame6.append({"Name":"Romeo","Age":26,"Department":"IT","Income":5500},ignore_index=True)

Unnamed: 0,Name,Department,Income,Age
0,Josh,IT,4800,24
1,Mike,Human Resources,5200,28
2,Julia,Finance,6600,33
3,Sergio,Supply Chain,5700,41
4,Romeo,IT,5500,26


#### Removing Rows

In [48]:
DataFrame6.drop(DataFrame6[DataFrame6["Age"]>30].index)

Unnamed: 0,Name,Department,Income,Age
0,Josh,IT,4800,24
1,Mike,Human Resources,5200,28
