## Tables Of Contents
1. [Append and concatenate](#app) 
    - [Append](#append)
    - [Concatenate](#concat)
2. [Joining Tables (Inner, Outer, Left, and Rights)](#tjoins)
    - [Inner Joins](#inner)
    - [Outer Joins](#outer)
    - [Left Joins](#left)
    - [Right Joins](#right)
3. [Handling Dates, Timezones, Unix Timestamps](#time)
    - [Handling Dates and Times](#htd)
    - [Handling Timezones](#Tzone)
    - [Handling Unix Timestamps](#unix)
    - [Timezone-aware Operations](#aware)
4. [Introduction to Rolling Operations](#roll_intro)
    - [Rolling Sum](#rsum)
    - [Rolling Mean](#rmean)
    - [Rolling Rank](#rrank)

# 1.Append and concatenate<a class = 'anchor' id = 'app'></a>
## Append <a class = 'anchor' id = 'append'></a>
The append() function in Pandas is used to append one DataFrame to another.
![image.png](attachment:e285f282-2ef2-4fa4-ac93-16514f62e5af.png)

In [1]:
import pandas as pd

# Create the two dataframes df1 and df2 

df1 = pd.DataFrame({"name" : ["Markus", "Edward"], 
                    "sales" : [34000,42000]} )

df2 = pd.DataFrame({"name" : ["Emma", "Thomas"], 
                    "sales" : [52000,72000]})

# display first dataframe
df1
                    

Unnamed: 0,name,sales
0,Markus,34000
1,Edward,42000


In [2]:
# second data frame
df2

Unnamed: 0,name,sales
0,Emma,52000
1,Thomas,72000


In [3]:
# now let's join the dataframes
# Append the DataFrames vertically
appended_df = df1.append(df2,ignore_index = True)   # ignore index = True will ignore

appended_df

  appended_df = df1.append(df2,ignore_index = True)   # ignore index = True will ignore


Unnamed: 0,name,sales
0,Markus,34000
1,Edward,42000
2,Emma,52000
3,Thomas,72000


Here, we created two DataFrames `df1` and `df2`. We then append `df2` to `df1` using the `append()` function, which **adds** the **rows of df2 to the bottom of df1**.

## Contcatenate <a class = 'anchor' id = 'concat'></a>
The **`concat()`** function in Pandas is used to concatenate two or more DataFrames together.

In [4]:
# Concatenate the DataFrames vertically
concatenated_df = pd.concat([df1, df2], axis = 0)
concatenated_df

Unnamed: 0,name,sales
0,Markus,34000
1,Edward,42000
0,Emma,52000
1,Thomas,72000


In [5]:
# Concatenate the DataFrames Horizentally
concatenated_df = pd.concat([df1, df2], axis = 1)
concatenated_df

Unnamed: 0,name,sales,name.1,sales.1
0,Markus,34000,Emma,52000
1,Edward,42000,Thomas,72000


## 2.Tables Joins  <a class = 'anchor' id = 'tjoins'></a>   

## Joining Tables (Inner, Outer, Left, and Right) <a class = 'anchor' id = tjoins></a>
- Joining tables is a fundamental operation in data analysis and database management.
- It allows us to combine data from multiple sources to gain comprehensive insights and make informed decisions. - - This is particularly relevant in various industries such as finance, e-commerce, healthcare, and government sectors, where data integration and analysis are crucial.
![Image](https://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png?ezimgfmt=ng:webp/ngcb1)

Employee Data

![image.png](attachment:52a88d4a-d86b-4988-abfe-a79ef659e6ec.png)

Department Data

![image.png](attachment:0e73ae5f-5bc0-4b5f-8946-8d0833c7c740.png)

In [6]:
# Create the Employees and Departments DataFrames
employees = pd.read_csv('dataset/Employees.csv',sep = '|')

departments = pd.read_csv('dataset/Departmeents.csv',sep = '|')


In [7]:
employees

Unnamed: 0,EmployeeID,EmployeeName,DepartmentID
0,1,John,1
1,2,Sarah,2
2,3,Rahul,2
3,4,Priya,3
4,5,Amit,1


In [8]:
departments

Unnamed: 0,DepartmentID,DepartmentName
0,1,Finance
1,2,Sales
2,3,HR
3,4,Marketing


### Inner Join <a class = 'anchor' id = inner ></a>
- An inner join returns only the matching records from both tables based on a common key.
- In Pandas, we can perform an inner join using the merge() function with the how='inner' parameter.


In [9]:
# Perform the Inner Join
inner_join = pd.merge(employees, departments, on='DepartmentID', how='inner')
inner_join

Unnamed: 0,EmployeeID,EmployeeName,DepartmentID,DepartmentName
0,1,John,1,Finance
1,5,Amit,1,Finance
2,2,Sarah,2,Sales
3,3,Rahul,2,Sales
4,4,Priya,3,HR


### Outer Join <a class = 'anchor' id = outer></a>
- An outer join returns all the records from both tables, including unmatched records. For the unmatched records, it fills the missing values with `NaN`.
- In Pandas, we can perform an outer join using the `merge()` function with the `how='outer'` parameter.

In [10]:
# Perform the Outer Join
outer_join = pd.merge(employees, departments, on='DepartmentID', how='outer')
outer_join

Unnamed: 0,EmployeeID,EmployeeName,DepartmentID,DepartmentName
0,1.0,John,1,Finance
1,5.0,Amit,1,Finance
2,2.0,Sarah,2,Sales
3,3.0,Rahul,2,Sales
4,4.0,Priya,3,HR
5,,,4,Marketing


### Left Join <a class = 'anchor' id = left ></a>
- A left join returns all the records from the left (first) table and the matching records from the right (second) table. If there are no matches, it fills the missing values with `NaN`.
- In Pandas, we can perform a left join using the `merge()` function with the `how='left'` parameter.

In [11]:
# Perform the Left Join
left_join = pd.merge(employees, departments, on='DepartmentID', how='left')
left_join

Unnamed: 0,EmployeeID,EmployeeName,DepartmentID,DepartmentName
0,1,John,1,Finance
1,2,Sarah,2,Sales
2,3,Rahul,2,Sales
3,4,Priya,3,HR
4,5,Amit,1,Finance


### Right Join <a class = 'anchor' id = right ></a>

- A right join returns all the records from the right (second) table and the matching records from the left (first) table. If there are no matches, it fills the missing values with `NaN`.
- In Pandas, we can perform a right join using the `merge()` function with the `how='right'` parameter.

In [12]:
# Perform the Right Join
right_join = pd.merge(employees, departments, on='DepartmentID', how='right')
right_join

Unnamed: 0,EmployeeID,EmployeeName,DepartmentID,DepartmentName
0,1.0,John,1,Finance
1,5.0,Amit,1,Finance
2,2.0,Sarah,2,Sales
3,3.0,Rahul,2,Sales
4,4.0,Priya,3,HR
5,,,4,Marketing


These examples demonstrate how the different join operations work using sample datasets in the Indian context. The same principles and techniques can be applied to real-world scenarios for data analysis and decision-making.

# 3.Handling Dates, Timezones, Unix Timestamps <a class = 'anchor' id = time></a> 

### Handling Dates and Times <a class = 'anchor' id = htd></a>
Pandas has a number of methods for working with dates and times. We can create a new DataFrame column of datetime objects from an existing column containing dates using the `to_datetime()` method:

In [13]:
# create a sample DataFrame
df = pd.DataFrame({
    'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04'],
    'value': [1, 2, 3, 4]})
df

Unnamed: 0,date,value
0,2023-04-01,1
1,2023-04-02,2
2,2023-04-03,3
3,2023-04-04,4


In [14]:
df.dtypes

date     object
value     int64
dtype: object

- Here the `date` column is obejct, we can also convert datatime object, that allows us to perform **Time** series related task.
- Pandas' obejct `to_date_time()` is used to typecast the data into datetime object.

In [15]:
# convert the 'date' column to a datetime object
df['date'] = pd.to_datetime(df['date'])

# print the resulting DataFrame
df

Unnamed: 0,date,value
0,2023-04-01,1
1,2023-04-02,2
2,2023-04-03,3
3,2023-04-04,4


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    4 non-null      datetime64[ns]
 1   value   4 non-null      int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 192.0 bytes


- This creates a new `date` column containing datetime objects that can be used for further calculations or manipulations.

### Handling Timezones <a class = 'anchor' id = Tzone ></a>
Pandas allows us to work with timezones using the tz parameter of the DatetimeIndex object. To convert a datetime column to a specific timezone, we can use the `tz_localize()` method. 


In [17]:
import pytz  # time zone python module

# create a sample DataFrame
df = pd.DataFrame({
    'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04 '],
    'value': [1, 2, 3, 4]})
df

Unnamed: 0,date,value
0,2023-04-01,1
1,2023-04-02,2
2,2023-04-03,3
3,2023-04-04,4


In [18]:
# convert the 'date' column to a datetime object and set the timezone to 'US/Eastern'
df['date'] = pd.to_datetime(df['date'])
# print the resulting DataFrame
df

Unnamed: 0,date,value
0,2023-04-01,1
1,2023-04-02,2
2,2023-04-03,3
3,2023-04-04,4


In [19]:
# now let's check time zone
df['date']

0   2023-04-01
1   2023-04-02
2   2023-04-03
3   2023-04-04
Name: date, dtype: datetime64[ns]

In [20]:
df['date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 4 entries, 0 to 3
Series name: date
Non-Null Count  Dtype         
--------------  -----         
4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 160.0 bytes


- We can see that there is no timezone specified. Lets fiex the time zones (`US/Eastern`).
- We used the dt accessor to access the **`tz_localize()`** method, which set the timezone to `'US/Eastern'`. This created a new column containing `timezone-aware` datetime objects.

In [21]:
df['date'] = pd.to_datetime(df['date']).dt.tz_localize('US/Eastern')
# print the resulting DataFrame
df

Unnamed: 0,date,value
0,2023-04-01 00:00:00-04:00,1
1,2023-04-02 00:00:00-04:00,2
2,2023-04-03 00:00:00-04:00,3
3,2023-04-04 00:00:00-04:00,4


In [22]:
df['date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 4 entries, 0 to 3
Series name: date
Non-Null Count  Dtype                     
--------------  -----                     
4 non-null      datetime64[ns, US/Eastern]
dtypes: datetime64[ns, US/Eastern](1)
memory usage: 160.0 bytes


- Here, we can see that now time zone has convert from `None` to ` US/Eastern`
- But once again the question is --
### How to know many time_zones are?
`pytz.all_timezones`

In [23]:
# print(pytz.all_timezones)

#### How convert time zones?
``` tz_convert()``` method is used to convert time zones.

Let's create **Indian** and **Israel** time zones and put these into columns.

In [24]:
df['date_Indian_time_zone'] = pd.to_datetime(df['date']).dt.tz_convert('Asia/Kolkata')
# print the resulting DataFrame
df


Unnamed: 0,date,value,date_Indian_time_zone
0,2023-04-01 00:00:00-04:00,1,2023-04-01 09:30:00+05:30
1,2023-04-02 00:00:00-04:00,2,2023-04-02 09:30:00+05:30
2,2023-04-03 00:00:00-04:00,3,2023-04-03 09:30:00+05:30
3,2023-04-04 00:00:00-04:00,4,2023-04-04 09:30:00+05:30


In [25]:
df['Israel_time_zone'] = pd.to_datetime(df['date']).dt.tz_convert('Israel')
# print the resulting DataFrame
df

Unnamed: 0,date,value,date_Indian_time_zone,Israel_time_zone
0,2023-04-01 00:00:00-04:00,1,2023-04-01 09:30:00+05:30,2023-04-01 07:00:00+03:00
1,2023-04-02 00:00:00-04:00,2,2023-04-02 09:30:00+05:30,2023-04-02 07:00:00+03:00
2,2023-04-03 00:00:00-04:00,3,2023-04-03 09:30:00+05:30,2023-04-03 07:00:00+03:00
3,2023-04-04 00:00:00-04:00,4,2023-04-04 09:30:00+05:30,2023-04-04 07:00:00+03:00


### Handling Unix Timestamps <a class ='anchor' id = unix ></a>
- Unix timestamps represent the number of seconds that have elapsed since **January 1, 1970** at **00:00:00 UTC**. 
- It is a common way to represent time in computer systems. 
- Pandas provides functions to convert Unix timestamps to datetime objects, allowing for easier manipulation and analysis.
- The **`to_datetime()`** function can be used to convert a Unix timestamp column to a datetime column

In [26]:
# Create a sample DataFrame with datetime data
data = {
    'Date': ['2023-06-14 12:30:45', '2023-06-14 14:45:30', '2023-06-14 18:20:15'],
    'Value': [10, 20, 30]
}
df = pd.DataFrame(data)

# Convert 'Date' column to datetime
# df['Date'] = pd.to_datetime(df['Date'])
df

Unnamed: 0,Date,Value
0,2023-06-14 12:30:45,10
1,2023-06-14 14:45:30,20
2,2023-06-14 18:20:15,30


In [27]:
# Create a  DataFrame with datetime data
data = {
    'Date': ['2023-06-14 12:30:45', '2023-06-14 14:45:30', '2023-06-14 18:20:15'],
    'Value': [10, 20, 30]
}
df = pd.DataFrame(data)



In [28]:
# Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'])
df

Unnamed: 0,Date,Value
0,2023-06-14 12:30:45,10
1,2023-06-14 14:45:30,20
2,2023-06-14 18:20:15,30


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    3 non-null      datetime64[ns]
 1   Value   3 non-null      int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 176.0 bytes


In [30]:
# Convert datetime to Unix timestamp
df['Timestamp'] = df['Date'].apply(lambda x: x.timestamp())
df

Unnamed: 0,Date,Value,Timestamp
0,2023-06-14 12:30:45,10,1686746000.0
1,2023-06-14 14:45:30,20,1686754000.0
2,2023-06-14 18:20:15,30,1686767000.0


In [31]:
# Convert Unix timestamp back to datetime
df['Date_from_timestamp'] = df['Timestamp'].apply(lambda x: pd.to_datetime(x, unit='s'))
df

Unnamed: 0,Date,Value,Timestamp,Date_from_timestamp
0,2023-06-14 12:30:45,10,1686746000.0,2023-06-14 12:30:45
1,2023-06-14 14:45:30,20,1686754000.0,2023-06-14 14:45:30
2,2023-06-14 18:20:15,30,1686767000.0,2023-06-14 18:20:15


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 3 non-null      datetime64[ns]
 1   Value                3 non-null      int64         
 2   Timestamp            3 non-null      float64       
 3   Date_from_timestamp  3 non-null      datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 224.0 bytes


- The application of **timestamps** in data science and machine learning have specific applications, including time series forecasting, fraud detection, user behavior analysis, time-based recommendations, time-dependent predictive modeling, A/B testing, resource allocation, real-time analytics, and event streaming.
- They enable accurate predictions, identify anomalies, personalize recommendations, analyze user behavior, optimize resource allocation, and process real-time data.

# 4.Introduction to Rolling Operations <a class = 'anchor' id = roll_intro></a>
The rolling operations are useful for calculating rolling statistics or aggregations, such as **rolling mean, rolling sum, rolling standard deviation, etc**. These operations allow you to analyze data **trends** over time or **to smooth out noisy data** by calculating aggregated values over a specified window.


- Rolling operations in pandas refer to performing calculations over a sliding window of data points in a time series or a moving window of data points in a numerical series. The `rolling()` function in pandas is used to perform rolling operations on a pandas dataframe or a pandas series.
- The `rolling()` function generates a rolling object which can be used to apply various mathematical and statistical functions over a rolling window of data points. The size of the rolling window can be specified using the window parameter of the `rolling()` function.
- We can then apply various aggregation functions, such as **`mean()`, `sum()`, `std()`, `min()`, `max()`**, etc., to the Rolling object to compute the desired rolling statistic.
#### Rolling Sum <a class = 'anchor' id = 'rsum'></a>
The rolling sum is the sum of the values in a rolling window. The `rolling()` function in pandas can be used to calculate the rolling sum of a pandas series or a pandas dataframe.

In [33]:
# create a pandas series
data = pd.Series([10, 20, 30, 40, 50, 60])

# calculate the rolling sum with window size 3
rolling_sum = data.rolling(window=3).sum()

print(rolling_sum)

0      NaN
1      NaN
2     60.0
3     90.0
4    120.0
5    150.0
dtype: float64


- We first create a pandas series data with values `10, 20, 30, 40, 50`, and `60`. 
- We then use the rolling() function to calculate the rolling sum of the series with a window size of 3. 
- The resulting rolling sum is `[NaN, NaN, 60.0, 90.0, 120.0, 150.0]`.

#### Rolling Mean <a class = 'anchor' id = 'rmean'></a>
The rolling mean is the mean of the values in a rolling window. The `rolling()` function in pandas can be used to calculate the rolling mean of a pandas series or a pandas dataframe.

In [34]:
# create a pandas series
data = pd.Series([10, 20, 30, 40, 50, 60])

# calculate the rolling sum with window size 3
rolling_mean = data.rolling(window=3).mean()

print(rolling_mean)

0     NaN
1     NaN
2    20.0
3    30.0
4    40.0
5    50.0
dtype: float64


- We first create a pandas series data with values `10, 20, 30, 40, 50`, and `60`. 
- We then use the rolling() function to calculate the rolling mean of the series with a window size of 3. 
- The resulting rolling sum is `[NaN, NaN, 20,30,40,50]`.

#### Rolling Ranks <a class = 'anchor' id = 'rrank'></a>
The rolling rank is the rank of the values in a rolling window. The `rolling()` function in pandas can be used to calculate the rolling rank of a pandas series or a pandas dataframe.

In [35]:
# create a pandas series
data = pd.Series([10, 20, 30, 40, 50, 60])

# calculate the rolling rank with window size 3
rolling_rank = data.rolling(window=3).apply(lambda x: pd.Series(x).rank().values[-1])

print(rolling_rank)

0    NaN
1    NaN
2    3.0
3    3.0
4    3.0
5    3.0
dtype: float64


- We first create a pandas series data with values `10, 20, 30, 40, 50, and 60`. We then use the `rolling()` function to calculate the rolling rank of the series with a window size of 3. 
- The `apply()` method is used to apply a **lambda function** that calculates the rank of the values in the rolling window and returns the last rank value. 
- The resulting rolling rank is `[NaN, NaN, 3.0, 3.0, 3.0, 3.0]`.

- Rolling operations in pandas are useful for analyzing time series and numerical data. The `rolling()` function in pandas can be used to calculate the `rolling sum`, `rolling rank`, and other **`rolling statistics`**. 
- By specifying the window size, we can control the size of the rolling window and the number of data points included in the rolling calculation.

In the same way we can perform others rolling operations.

In [36]:
# Calculate rolling sum with a window size of 3
rolling_sum = data.rolling(window=3).sum()
rolling_sum

0      NaN
1      NaN
2     60.0
3     90.0
4    120.0
5    150.0
dtype: float64

In [37]:
# Calculate rolling standard deviation with a window size of 4
rolling_std = data.rolling(window=4).std()
print(rolling_std)

0          NaN
1          NaN
2          NaN
3    12.909944
4    12.909944
5    12.909944
dtype: float64


In [38]:
# Calculate rolling minimum with a window size of 2
rolling_min = data.rolling(window=2).min()
print(rolling_min)

0     NaN
1    10.0
2    20.0
3    30.0
4    40.0
5    50.0
dtype: float64


In [39]:
# Calculate rolling maximum with a window size of 3
rolling_max = data.rolling(window=3).max()
print(rolling_max)

0     NaN
1     NaN
2    30.0
3    40.0
4    50.0
5    60.0
dtype: float64
