# Table of Contents
1. [Shifting and Lagging](#sl) 
2. [Pivot and Pivot Table](#ppt)
2. [Working With Text Data](#text)
    - [Lowercasing text](#Ltext)
    - [Removing punctuation](#remove)
    - [Removing numbers](#removeN)
    - [Removing whitespace](#removeW)

## Shifting and Lagging <a class = 'anchor' id = sl></a>  
- A **`shifting`** method (or function) refers to the process of moving the values in a column or a time series (time dependent data points) by a specified number of periods. 
- It allows us to create new columns based on the shifted values of existing columns or perform calculations based on the lagged values of a time series.

- The **`shift()`** function in pandas is used to **shift** the values in a Series or DataFrame. It takes an optional parameter **periods** that specifies the number of periods to shift by.
- If periods is **positive**, the values are **shifted forward** in time (down the column), while a **negative** value of **`periods` shifts the values backward** in time (up the column).

In [1]:
import pandas as pd

data = {'A': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Shift values in column 'A' by 1 period forward
df['Shifted'] = df['A'].shift(1)

df

Unnamed: 0,A,Shifted
0,1,
1,2,1.0
2,3,2.0
3,4,3.0
4,5,4.0


The values in column **`A`** are shifted by **1 period forward**, resulting in a new column **`'Shifted'`** where each value is the previous value of **`A`**.

In [2]:
# Shift values in column 'A' by 2 period forward
df['Shifted'] = df['A'].shift(2)
df

Unnamed: 0,A,Shifted
0,1,
1,2,
2,3,1.0
3,4,2.0
4,5,3.0


In [3]:
# Shift values in column 'A' by -1 period backwards
df['Shifted'] = df['A'].shift(-1)
df

Unnamed: 0,A,Shifted
0,1,2.0
1,2,3.0
2,3,4.0
3,4,5.0
4,5,


In [4]:
# Shift values in column 'A' by -2 period backwards 
df['Shifted'] = df['A'].shift(-2)
df

Unnamed: 0,A,Shifted
0,1,3.0
1,2,4.0
2,3,5.0
3,4,
4,5,


Shifting and lagging in data science and machine learning are applied for **time series forecasting, feature engineering, and temporal analysis, allowing for prediction, extraction of informative features, and identification of patterns and trends in data**.

## Pivot and Pivot Table <a class = 'anchor' id = ppt></a>
### **`Pivot`**
- The `pivot()` function in pandas reshapes data by converting values from a column into separate columns, effectively rotating the data. 
- It requires specifying a column to use as the index, a column to use as the new columns, and a column to use as the values. This transformation is useful when we want to convert a long-format dataset into a wide-format representation.

### **`Pivot Table`** 
- The `pivot_table()` function in pandas creates a **multi-dimensional summary of data**, similar to an Excel pivot table.
- It allows you to aggregate and summarize data based on one or more columns, producing a new DataFrame with aggregated values. 
- We can specify the columns to use as the index, columns to use as columns in the pivot table, and columns to use for calculations (e.g., sum, mean, count). Pivot tables are commonly used for data exploration, analysis, and reporting.

In [5]:
# Let's create a sample dataset 
df = pd.read_csv('dataset/pivot.csv')
df

Unnamed: 0,Date,Category,Value
0,2021-01-01,A,10
1,2021-01-01,B,20
2,2021-01-02,A,30
3,2021-01-02,B,40
4,2021-01-03,A,50
5,2021-01-03,B,60
6,2021-01-04,A,70
7,2021-01-04,B,80
8,2021-01-05,A,90
9,2021-01-05,B,100


In [6]:
# Pivot 
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
pivot_df

Category,A,B
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-01,10,20
2021-01-02,30,40
2021-01-03,50,60
2021-01-04,70,80
2021-01-05,90,100


- The original DataFrame has three columns: **`'Date', 'Category', and 'Value'`**. 
- By using `pivot()`, we transformed the data based on the **`'Date'`** column as the index, `Category` column as the new columns, and `Value` column as the values, resulting in a new DataFrame with the values rearranged.
- Pivot tables provide additional functionality and flexibility compared to simple pivoting. They allow you to perform **aggregations**, specify multiple index and column levels, apply custom aggregation functions, handle missing data, and more.

In [7]:
# Pivot table example
pivot_table_df = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')

pivot_table_df


Category,A,B
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-01,10,20
2021-01-02,30,40
2021-01-03,50,60
2021-01-04,70,80
2021-01-05,90,100


Here **`aggfunc`** Specifies the aggregation function(s) to be applied to the values. It can be a built-in function like '**sum', 'mean', 'count'**, etc., or a custom aggregation function.

- In this pivot table example, we used `pivot_table()` to create a summary of the data, aggregating the values using the sum function. The resulting pivot table provides a compact representation of the data, showing the total values for each category on each date.

- Both `pivot()` and `pivot_table()` functions are powerful tools for reshaping and summarizing data in pandas, providing flexibility in data analysis and exploration.

## Working With Text Data

- Text data in particular can be extremely messy and difficult to work with because it can contain all sorts of characters and symbols that may have little meaning for your analysis. This lesson will cover some basic techniques and functions for working with text data in Python.
- Cleaning text data is an important step in any data analysis or machine learning project.
- Text data can contain various inconsistencies, such as capitalization, spelling mistakes, punctuation, special characters, and more. 
- We will explore how to clean text data inside a pandas DataFrame using Python.


The `.str` is an attribute in pandas that provides access to a set of string methods for operating on string data within a column. When we have a column containing string data in a pandas DataFrame, you can use the `.str` attribute to apply various string operations and methods to manipulate and analyze the string values within that column.  

- **`len()`**: Calculates the length of each string.
- **`lower()`**: Converts each string to lowercase.
- **`upper()`**: Converts each string to uppercase.
- **`contains()`**: Checks if each string contains a specific substring.
- **`replace()`**: Replaces specific substrings within each string.
- **`split()`**: Splits each string into a list based on a specified delimiter.
- **`strip()`**: Removes leading and trailing whitespace from each string. and etc

In [8]:
import pandas as pd
data = {'text': ['This is a sample sentence.',
                 '    This is another sentence with white spaces !  ',
                 'This is a third sentence with numbers 1234 and another numbers 343324.',
                 'This is the 4th sentence, with punctuation?',
                 'This is a 5th sentence with special characters #%^!',
                '   This is 6th sentence with extra white spaces      ']}
df = pd.DataFrame(data)
df

Unnamed: 0,text
0,This is a sample sentence.
1,This is another sentence with white spaces...
2,This is a third sentence with numbers 1234 and...
3,"This is the 4th sentence, with punctuation?"
4,This is a 5th sentence with special characters...
5,This is 6th sentence with extra white space...


### Lowercasing text <a class = 'anchor' id = 'Ltext'></a>
- String functions in pandas **mirror built** in string functions and many have the **same name** as their **singular counterparts**.
- For example, **`str.lower()`** converts a single string to lowercase, while **`series.str.lower()`** converts all the strings in a series to lowercase.
- Lowercasing text is a common step in text cleaning, as it helps in standardizing the text data.
- We can use the **`str.lower()`** method of pandas to convert all the text data in a DataFrame to lowercase.

In [9]:
df['text1'] = df['text'].str.lower()
df

Unnamed: 0,text,text1
0,This is a sample sentence.,this is a sample sentence.
1,This is another sentence with white spaces...,this is another sentence with white spaces...
2,This is a third sentence with numbers 1234 and...,this is a third sentence with numbers 1234 and...
3,"This is the 4th sentence, with punctuation?","this is the 4th sentence, with punctuation?"
4,This is a 5th sentence with special characters...,this is a 5th sentence with special characters...
5,This is 6th sentence with extra white space...,this is 6th sentence with extra white space...


### Uppercasing text <a class = 'anchor' id = 'Utext'></a>
- Uppercasing text is also a common step in text cleaning, as it helps in standardizing the text data.
- We can use the str.upper() method of pandas to convert all the text data in a DataFrame to Uppercase.

In [10]:
df['text1'] = df['text'].str.upper()
df

Unnamed: 0,text,text1
0,This is a sample sentence.,THIS IS A SAMPLE SENTENCE.
1,This is another sentence with white spaces...,THIS IS ANOTHER SENTENCE WITH WHITE SPACES...
2,This is a third sentence with numbers 1234 and...,THIS IS A THIRD SENTENCE WITH NUMBERS 1234 AND...
3,"This is the 4th sentence, with punctuation?","THIS IS THE 4TH SENTENCE, WITH PUNCTUATION?"
4,This is a 5th sentence with special characters...,THIS IS A 5TH SENTENCE WITH SPECIAL CHARACTERS...
5,This is 6th sentence with extra white space...,THIS IS 6TH SENTENCE WITH EXTRA WHITE SPACE...


###  Removing punctuation <a class = 'anchor' id = 'remove'></a>
- Punctuation marks such as periods, commas, and semicolons do not add any value to text data analysis. 
- We can remove punctuation using the `str.replace()`method of pandas and regular expressions.

In [11]:
df['text2'] = df['text'].str.replace('[^\w\s]', '')
df

  df['text2'] = df['text'].str.replace('[^\w\s]', '')


Unnamed: 0,text,text1,text2
0,This is a sample sentence.,THIS IS A SAMPLE SENTENCE.,This is a sample sentence
1,This is another sentence with white spaces...,THIS IS ANOTHER SENTENCE WITH WHITE SPACES...,This is another sentence with white spaces
2,This is a third sentence with numbers 1234 and...,THIS IS A THIRD SENTENCE WITH NUMBERS 1234 AND...,This is a third sentence with numbers 1234 and...
3,"This is the 4th sentence, with punctuation?","THIS IS THE 4TH SENTENCE, WITH PUNCTUATION?",This is the 4th sentence with punctuation
4,This is a 5th sentence with special characters...,THIS IS A 5TH SENTENCE WITH SPECIAL CHARACTERS...,This is a 5th sentence with special characters
5,This is 6th sentence with extra white space...,THIS IS 6TH SENTENCE WITH EXTRA WHITE SPACE...,This is 6th sentence with extra white space...


### Removing numbers <a class = 'anchor' id = 'removeN'></a>
- In many cases, numbers may not be relevant to text data analysis. 
- We can remove numbers using the same str.replace() method.

In [12]:
 # here \d+ used for digit, for information please read 're' python module
df['text3'] = df['text'].str.replace('\d+', '')  
df

  df['text3'] = df['text'].str.replace('\d+', '')


Unnamed: 0,text,text1,text2,text3
0,This is a sample sentence.,THIS IS A SAMPLE SENTENCE.,This is a sample sentence,This is a sample sentence.
1,This is another sentence with white spaces...,THIS IS ANOTHER SENTENCE WITH WHITE SPACES...,This is another sentence with white spaces,This is another sentence with white spaces...
2,This is a third sentence with numbers 1234 and...,THIS IS A THIRD SENTENCE WITH NUMBERS 1234 AND...,This is a third sentence with numbers 1234 and...,This is a third sentence with numbers and ano...
3,"This is the 4th sentence, with punctuation?","THIS IS THE 4TH SENTENCE, WITH PUNCTUATION?",This is the 4th sentence with punctuation,"This is the th sentence, with punctuation?"
4,This is a 5th sentence with special characters...,THIS IS A 5TH SENTENCE WITH SPECIAL CHARACTERS...,This is a 5th sentence with special characters,This is a th sentence with special characters ...
5,This is 6th sentence with extra white space...,THIS IS 6TH SENTENCE WITH EXTRA WHITE SPACE...,This is 6th sentence with extra white space...,This is th sentence with extra white spaces...


### Removing whitespace <a class = 'anchor' id = 'removeW'></a>
- Extra whitespace can be distracting and may affect text data analysis.
- We can remove extra whitespace using the `str.strip()` method of pandas.
-  We can also `str.rstrip()` or `str.lstrip()`.

In [13]:
df['text4'] = df['text'].str.strip()
df

Unnamed: 0,text,text1,text2,text3,text4
0,This is a sample sentence.,THIS IS A SAMPLE SENTENCE.,This is a sample sentence,This is a sample sentence.,This is a sample sentence.
1,This is another sentence with white spaces...,THIS IS ANOTHER SENTENCE WITH WHITE SPACES...,This is another sentence with white spaces,This is another sentence with white spaces...,This is another sentence with white spaces !
2,This is a third sentence with numbers 1234 and...,THIS IS A THIRD SENTENCE WITH NUMBERS 1234 AND...,This is a third sentence with numbers 1234 and...,This is a third sentence with numbers and ano...,This is a third sentence with numbers 1234 and...
3,"This is the 4th sentence, with punctuation?","THIS IS THE 4TH SENTENCE, WITH PUNCTUATION?",This is the 4th sentence with punctuation,"This is the th sentence, with punctuation?","This is the 4th sentence, with punctuation?"
4,This is a 5th sentence with special characters...,THIS IS A 5TH SENTENCE WITH SPECIAL CHARACTERS...,This is a 5th sentence with special characters,This is a th sentence with special characters ...,This is a 5th sentence with special characters...
5,This is 6th sentence with extra white space...,THIS IS 6TH SENTENCE WITH EXTRA WHITE SPACE...,This is 6th sentence with extra white space...,This is th sentence with extra white spaces...,This is 6th sentence with extra white spaces


### Number of characters in  text


In [14]:
df['Number_of_characters'] = df['text'].str.len()
df

Unnamed: 0,text,text1,text2,text3,text4,Number_of_characters
0,This is a sample sentence.,THIS IS A SAMPLE SENTENCE.,This is a sample sentence,This is a sample sentence.,This is a sample sentence.,26
1,This is another sentence with white spaces...,THIS IS ANOTHER SENTENCE WITH WHITE SPACES...,This is another sentence with white spaces,This is another sentence with white spaces...,This is another sentence with white spaces !,50
2,This is a third sentence with numbers 1234 and...,THIS IS A THIRD SENTENCE WITH NUMBERS 1234 AND...,This is a third sentence with numbers 1234 and...,This is a third sentence with numbers and ano...,This is a third sentence with numbers 1234 and...,70
3,"This is the 4th sentence, with punctuation?","THIS IS THE 4TH SENTENCE, WITH PUNCTUATION?",This is the 4th sentence with punctuation,"This is the th sentence, with punctuation?","This is the 4th sentence, with punctuation?",43
4,This is a 5th sentence with special characters...,THIS IS A 5TH SENTENCE WITH SPECIAL CHARACTERS...,This is a 5th sentence with special characters,This is a th sentence with special characters ...,This is a 5th sentence with special characters...,51
5,This is 6th sentence with extra white space...,THIS IS 6TH SENTENCE WITH EXTRA WHITE SPACE...,This is 6th sentence with extra white space...,This is th sentence with extra white spaces...,This is 6th sentence with extra white spaces,53


### Number of words

In [15]:
# df['text'].apply(lambda x : len(x.split()))   ##you can  also use this method
df['Number_of_words'] = df['text'].str.split().str.len()
df

Unnamed: 0,text,text1,text2,text3,text4,Number_of_characters,Number_of_words
0,This is a sample sentence.,THIS IS A SAMPLE SENTENCE.,This is a sample sentence,This is a sample sentence.,This is a sample sentence.,26,5
1,This is another sentence with white spaces...,THIS IS ANOTHER SENTENCE WITH WHITE SPACES...,This is another sentence with white spaces,This is another sentence with white spaces...,This is another sentence with white spaces !,50,8
2,This is a third sentence with numbers 1234 and...,THIS IS A THIRD SENTENCE WITH NUMBERS 1234 AND...,This is a third sentence with numbers 1234 and...,This is a third sentence with numbers and ano...,This is a third sentence with numbers 1234 and...,70,12
3,"This is the 4th sentence, with punctuation?","THIS IS THE 4TH SENTENCE, WITH PUNCTUATION?",This is the 4th sentence with punctuation,"This is the th sentence, with punctuation?","This is the 4th sentence, with punctuation?",43,7
4,This is a 5th sentence with special characters...,THIS IS A 5TH SENTENCE WITH SPECIAL CHARACTERS...,This is a 5th sentence with special characters,This is a th sentence with special characters ...,This is a 5th sentence with special characters...,51,9
5,This is 6th sentence with extra white space...,THIS IS 6TH SENTENCE WITH EXTRA WHITE SPACE...,This is 6th sentence with extra white space...,This is th sentence with extra white spaces...,This is 6th sentence with extra white spaces,53,8


### Combine all strings
We can combine all the strings in a series together into a single string with **`series.str.cat()`**.

In [16]:
text = df['text'].str.cat()
text

'This is a sample sentence.    This is another sentence with white spaces !  This is a third sentence with numbers 1234 and another numbers 343324.This is the 4th sentence, with punctuation?This is a 5th sentence with special characters #%^!   This is 6th sentence with extra white spaces      '

In [None]:
type(text)

str