In [18]:
import pandas as pd
import numpy as np

# Introduction

## 1. Pandas Series
- A Pandas Series is a one-dimensional labeled array-like object that can hold data of any type.
- A Pandas Series can be thought of as a column in a spreadsheet or a single column of a DataFrame. It consists of two main components: the labels and the data.

### Creating a Pandas Series

In [2]:
data = [1, 2, 3, 4, 5]
my_series = pd.Series(data)
print(my_series)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In this example, we created a Python list called `data` containing five integer values. We then passed this list to the `Series()` function, which converted it into a Pandas Series called `my_series`.

Here, `dtype: int64` denotes that the series stores the values of int64 types

## 2. Pandas DataFrame

#### I. Pandas DataFrame Using Python Dictionary

In [3]:
data = {
    "Name": ["Prabin", "John", "Sushma"],
    "Age":[25, 30, 35],
    "City": ["New York", "London", "Paris"]}
df = pd.DataFrame(data)
print(df)

     Name  Age      City
0  Prabin   25  New York
1    John   30    London
2  Sushma   35     Paris


In this example, we created a dictionary called `data` that contains the column names (`Name`, `Age`, `City`) as keys, and lists of values as their respective values.

We then used the `pd.DataFrame()` function to convert the dictionary into a DataFrame called `df`.

<br>

#### II. Pandas DataFrame Using Python List

In [4]:

data = [["Prabin", 25, "NY"],
        ["John", 22, "London"],
        ["Bob", 33, "Paris"]]

df = pd.DataFrame(data, columns=["Name", "Age", "City"])
print(df)

     Name  Age    City
0  Prabin   25      NY
1    John   22  London
2     Bob   33   Paris


<br>

#### III. Pandas DataFrame From a File
Another common way to create a DataFrame is by loading data from a CSV (Comma-Separated Values) file. For example,

In [5]:
df = pd.read_csv("C:\\Users\\hp\\Downloads\\archive\\train.csv")
print(df.head()) 

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In this example, we used the `read_csv()` function which reads the CSV file `train.csv`, and automatically creates a DataFrame object `df`, containing data from the CSV file.

**Note**: We can also create a DataFrame using other file types like JSON, Excel spreadsheet, SQL database, etc. The methods to read different file types are listed below:
- **Json** = `read_json()`
- **Excel spreadsheet** = `read_excel()`
- **SQL** = `read_sql()`

<br><br>

## 3. Pandas Index
In Pandas, an index refers to the labeled array that identifies rows or columns in a DataFrame or a Series. For example,

In [6]:
data = {
    "Name": ["Prabin", "John", "Sushma"],
    "Age":[25, 30, 35],
    "City": ["New York", "London", "Paris"]}
df = pd.DataFrame(data)
print(df)

     Name  Age      City
0  Prabin   25  New York
1    John   30    London
2  Sushma   35     Paris


In the above `DataFrame`, the numbers **0**, **1**, and **2** represent the index, providing unique labels to each row.

We can use indexes to uniquely identify data and access data with efficiency and precision.n.

### a. Create Indexes in Pandas
Pandas offers several ways to create indexes. Some common methods are as follows:
- Default Index
- Setting Index
- Creating a Range Index

#### I. Default Index
When we create a `DataFrame` or `Series` without specifying an `index` explicitly, `Pandas` assigns a `default` integer index starting from 0. For example,

In [7]:
data = {
    "Name": ["Prabin", "John", "Sushma"],
    "Age":[25, 30, 35],
    "City": ["New York", "London", "Paris"]}
df = pd.DataFrame(data)
print(df)

     Name  Age      City
0  Prabin   25  New York
1    John   30    London
2  Sushma   35     Paris


In this example, the default index `[0, 1, 2]` is automatically assigned to the rows.

<br>

#### II. Setting Index
We can set an `existing column` as the index using the `set_index()` method. For example,

In [8]:
data = {
    "Name": ["Prabin", "John", "Sushma"],
    "Age":[25, 30, 35],
    "City": ["New York", "London", "Paris"]}
df = pd.DataFrame(data)
#print(df.set_index("Name")) # This will print as it is 
df.set_index("Name", inplace=True)
print(df)

        Age      City
Name                 
Prabin   25  New York
John     30    London
Sushma   35     Paris


In this example, the `Name` column is set as the index, replacing the default integer index.

Here, the `inplace=True` parameter performs the operation directly on the object itself, without creating a new object. When we specify `inplace=True`, the original object is modified, and the changes are directly applied.

<br>

#### III. Creating a Range Index
We can create a range index with specific start and end values using the `RangeIndex()` function. For example,

In [9]:
data = {
    "Name": ["Prabin", "John", "Sushma"],
    "Age":[25, 30, 35],
    "City": ["New York", "London", "Paris"]}
df = pd.DataFrame(data)

# create a range index.
df = pd.DataFrame(data, index=pd.RangeIndex(1,4, name="Index"))
print(df)

         Name  Age      City
Index                       
1      Prabin   25  New York
2        John   30    London
3      Sushma   35     Paris


In [10]:
# Create a Range Index 5, 6, 7 instead of 1, 2, 3:
df = pd.DataFrame(data, index=pd.RangeIndex(5, 8, name="Index"))
print(df)

         Name  Age      City
Index                       
5      Prabin   25  New York
6        John   30    London
7      Sushma   35     Paris


Here, a range index from 5 to 8(excluded) is created with the name `Index`.

<br>

### b. Modifying Indexes in Pandas
`Pandas` allows us to make `changes` to `indexes` easily. Some common modification operations are:
- Renaming Index
- Resetting Index

#### I. Renaming Index
We can rename an index using the `rename()` method. For example,

In [11]:
# Create a DataFrame
data={
    'Name': ["John", "Alice", "Sushma"],
    'Age': [25, 26, 27],
    'City': ["NY", "London", "Paris"]
}
df = pd.DataFrame(data)

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Renaming Index
df.rename(index={0:'A', 1:'B', 2:'C'}, inplace=True)

# Display dataframe after index is renamed
print("\nModified DataFrame")
print(df)

Original DataFrame:
     Name  Age    City
0    John   25      NY
1   Alice   26  London
2  Sushma   27   Paris

Modified DataFrame
     Name  Age    City
A    John   25      NY
B   Alice   26  London
C  Sushma   27   Paris


In this example, we renamed the indexes **0**, **1**, and **2** to `'A'`, `'B'`, and `'C'` respectively.

<br>

#### II. Resetting Index
We can reset the index to the default integer index using the `reset_index()` method. For example,

In [12]:
data={
    'Name': ["John", "Alice", "Sushma"],
    'Age': [25, 26, 27],
    'City': ["NY", "London", "Paris"]
}
df = pd.DataFrame(data)

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Renaming Index
df.rename(index={0:'A', 1:'B', 2:'C'}, inplace=True)

# Display dataframe after index is renamed
print("\nModified DataFrame")
print(df)

# Reset the index to original
df.reset_index(inplace=True)

# Display DataFrame after index is reset
print("\nReturn to Original DataFrame")
print(df)

Original DataFrame:
     Name  Age    City
0    John   25      NY
1   Alice   26  London
2  Sushma   27   Paris

Modified DataFrame
     Name  Age    City
A    John   25      NY
B   Alice   26  London
C  Sushma   27   Paris

Return to Original DataFrame
  index    Name  Age    City
0     A    John   25      NY
1     B   Alice   26  London
2     C  Sushma   27   Paris


<br><br>

<br><br><br>

# DataFrame Operations and Manipulations

## 1. Pandas DataFrame Analysis
Pandas DataFrame objects come with a variety of built-in-functions like `head()`, `tail()`, and `info()` that allow us to view and analyze DataFrames


In [13]:
data = pd.read_csv("C:\\Users\\hp\\Downloads\\archive\\train.csv")
print(data.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


### a. Pandas head()

The `head()` method provides a rapid summary of a DataFrame. It returns the column headers and a specified number of rows from the beginning. For example,

In [14]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<br>

### b. Pandas tail()
The `tail()` method is similar to head() but it returns data starting from the end of the DataFrame. For example,

In [15]:
data.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


<br>

### c. Get DataFrame Information -> info()
The `info()` method gives us the overall information about the DataFrame such as its class, data type, size etc.

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


<br><br>

## 2. Pandas DataFrmae Manipulation
DataFrame manipulation in Pandas involves editing and modifying existing DataFrames. Some common DataFrame manipulation operations are:
- Adding rows/columns
- Removing rows/columns
- Renaming rows/columns

#### a. Adding rows/columns
We can add a new column to an existing Pandas DataFrame by simply declaring a new list as a column. For example,

In [38]:
data.shape

(891, 12)

In [40]:
# Adding columns
data['SpecialAddition'] = np.arange(1, 892)
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'SpecialAddition'],
      dtype='object')

In [41]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SpecialAddition
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,4
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,5


In [42]:
# Adding rows
data.loc[894] =[895, 1, 3, "Thapa, Mr. Prabin", "male", 25, 0, 0, "CC456", 8.90,np.nan, "S", 894]

In [43]:
data.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SpecialAddition
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S,888
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,889
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C,890
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q,891
894,895,1,3,"Thapa, Mr. Prabin",male,25.0,0,0,CC456,8.9,,S,894


<br>

#### b. Removing Columns and rows

In [44]:
# Removing columns
data.drop("SpecialAddition", axis=1, inplace=True)
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [45]:
# Removing rows
data.drop(894, axis=0, inplace=True)
data.tail

<bound method NDFrame.tail of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                     

In [46]:
data.shape

(891, 12)

In [47]:
# Both removing row and columns

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'London', 'Paris', 'Tokyo'],
        'Height': ['165', '178', '185', '171'],
        'Profession': ['Engineer', 'Entrepreneur', 'Unemployed', 'Actor'],
        'Marital Status': ['Single', 'Married', 'Divorced', 'Engaged']}
df = pd.DataFrame(data)

# display the original DataFrame
print("Original DataFrame:")
print(df)
print()

# delete age column
df.drop('Age', axis=1, inplace=True)

# delete marital status column
df.drop(columns='Marital Status', inplace=True)

# delete height and profession columns
df.drop(['Height', 'Profession'], axis=1, inplace=True)

# display the modified DataFrame after deleting rows
print("Modified DataFrame:")
print(df)

Original DataFrame:
      Name  Age      City Height    Profession Marital Status
0    Alice   25  New York    165      Engineer         Single
1      Bob   30    London    178  Entrepreneur        Married
2  Charlie   35     Paris    185    Unemployed       Divorced
3    David   40     Tokyo    171         Actor        Engaged

Modified DataFrame:
      Name      City
0    Alice  New York
1      Bob    London
2  Charlie     Paris
3    David     Tokyo


<br>

#### c. Renaming Rows/Columns

##### Renaming Columns

In [52]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'London', 'Paris', 'Tokyo'],
        'Height': ['165', '178', '185', '171'],
        'Profession': ['Engineer', 'Entrepreneur', 'Unemployed', 'Actor'],
        'Marital Status': ['Single', 'Married', 'Divorced', 'Engaged']}
df = pd.DataFrame(data)

# display the original DataFrame
print("Original DataFrame:")
print(df)
print()

# Rename column 'Name' to 'First_Name':
df.rename(columns={'Name':'First_Name'}, inplace=True)

# Rename column 'Age' and 'profession'
df.rename(mapper={'Age':'Old', 'Profession':'Career'}, axis=1, inplace=True)
print("Modified DataFrame:")
print(df)

Original DataFrame:
      Name  Age      City Height    Profession Marital Status
0    Alice   25  New York    165      Engineer         Single
1      Bob   30    London    178  Entrepreneur        Married
2  Charlie   35     Paris    185    Unemployed       Divorced
3    David   40     Tokyo    171         Actor        Engaged

Modified DataFrame:
  First_Name  Old      City Height        Career Marital Status
0      Alice   25  New York    165      Engineer         Single
1        Bob   30    London    178  Entrepreneur        Married
2    Charlie   35     Paris    185    Unemployed       Divorced
3      David   40     Tokyo    171         Actor        Engaged


In this example, we renamed a single column using the `columns={'Name': 'First_Name'}` parameter. We also renamed multiple columns with `mapper={'Age': 'Number', 'Profession':'Career'}` argument.

- `axis=1`: indicates that columns are to be renamed
- `inplace=True`: indicates that the changes are to be made in the original DataFrame

##### Renaming rows

In [56]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# display the original DataFrame
print("Original DataFrame:")
print(df)
print()

# Renaming rows/ column one index level
df.rename(index={0: 7}, inplace=True) 

# rename multiple rows/index labels
df.rename(mapper={1:8, 2:9, 3:10}, axis=0, inplace=True)

print("Modified DataFrame:")
print(df)

Original DataFrame:
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    David   40     Tokyo

Modified DataFrame:
       Name  Age      City
7     Alice   25  New York
8       Bob   30    London
9   Charlie   35     Paris
10    David   40     Tokyo


In this example, we renamed a single row using the `index={0: 7}` parameter. We also renamed multiple rows with `mapper={1: 8, 2: 9, 3:10}` argument.

`axis=0`: indicates that rows are to be renamed

<br><br>

# Pandas Indexing and Slicing

In Pandas, indexing refers to accessing rows and columns of data from a DataFrame, whereas slicing refers to accessing a range of rows and columns.

We can access data or range of data from a DataFrame using different methods.

### Access Columns of a DataFrame

In [61]:
df = pd.read_csv("C:\\Users\\hp\\Downloads\\archive\\train.csv")
(df.head()) 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [68]:
df['Sex'].head() # Access single columns

0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object

In [69]:
df[['Sex', 'Age']].tail() #Access multiple columns

Unnamed: 0,Sex,Age
886,male,27.0
887,female,19.0
888,female,
889,male,26.0
890,male,32.0


In this example, we accessed the `Sex` and the `Age` columns of `df` using the `[]` operator. It returned a DataFrame containing the values from `Sex` and `Age` of df.

The `[]` operator, however, provides limited functionality. Even basic operations like selecting rows, slicing DataFrames and selecting individual elements are quite tricky using the `[]` operator only.

So we use the `.loc` and properties for indexing and slicing DataFrames. They provide much more flexibility compared to the `[]` operator.



### Pandas.loc
In Pandas, we use the `.loc` property to access and modify data within a DataFrame using **label-based** indexing. It allows us to select specific rows and columns based on their labels.

Basic Syntax
```Python
df.loc[row_indexer, column_indexer]
```
Here, 
- `row_indexer` - selects rows by their labels, can be a single label, a list of labels, or a boolean array
- `column_indexer` - selects columns, can also be a single label, a list of labels, or a boolean array

#### Indexing with `.loc[]`

In [73]:
df.loc[[0, 3, 4], ["Name", "Embarked"]] # Indexing 

Unnamed: 0,Name,Embarked
0,"Braund, Mr. Owen Harris",S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",S
4,"Allen, Mr. William Henry",S


#### Slicing with `.loc[]`

In [74]:
df.loc[1:5, "Name":"Sex"] # Slicing

Unnamed: 0,Name,Sex
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,"Heikkinen, Miss. Laina",female
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
4,"Allen, Mr. William Henry",male
5,"Moran, Mr. James",male


#### Boolean Indexing with `.loc[]`

In [78]:
df.loc[df['Age']>70]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


<br><br>

# Pandas Select

- It is selecting Data Using Indexing and Slicing
- Mostly we use `.loc[]`, `.at[]`, `.iat[]` for Indexing and Slicing of large Datasets.
- We already looked about this matters.
- We will talk about `df.query()` method of selecting data in this chapter.

#### `df.query()` to Select Data