# 0.pandas library course

follow this link: https://youtu.be/vmEHCJofslg?si=cwhEaK8RUb-6PQ6J

# 1.what is panda?


### Definition of Pandas

**Pandas** is a powerful, open-source data analysis and manipulation library for Python. It provides high-performance, easy-to-use data structures and data analysis tools. Pandas is designed for working with structured data, such as tables or spreadsheets, and is widely used in data science, machine learning, and data analysis projects. 

### Key Features of Pandas

1. **Data Structures**:
   - **Series**: A one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, etc.). Each element in a Series is associated with an index.
   - **DataFrame**: A two-dimensional labeled data structure with columns of potentially different data types. It can be thought of as a dictionary of Series objects, where each Series represents a column in the DataFrame.

2. **Data Alignment and Handling Missing Data**:
   - Pandas handles missing data gracefully, with functions to detect, fill, or remove missing data.
   - Automatic and explicit data alignment to align data based on index labels.

3. **Data Wrangling**:
   - **Merging and Joining**: Combine data from different sources, similar to SQL joins.
   - **Concatenation**: Append or concatenate data along different axes.
   - **Pivoting and Reshaping**: Transform data for analysis, including pivot tables and melting data.

4. **Data Transformation**:
   - **Group By**: Perform split-apply-combine operations on datasets, allowing for aggregation, transformation, and filtering.
   - **Vectorized Operations**: Perform operations on entire columns or datasets without writing explicit loops, which is more efficient and concise.

5. **Time Series Analysis**:
   - Tools for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.

6. **Input and Output**:
   - Read and write data from/to various file formats, including CSV, Excel, SQL databases, JSON, HTML, and more.

### Example of Basic Operations in Pandas

#### Creating a DataFrame

```python
import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
        'age': [24, 27, 22, 32, 29],
        'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}
df = pd.DataFrame(data)
print(df)
```

#### Output:

```
      name  age         city
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
4   Edward   29      Phoenix
```

#### Data Manipulation

```python
# Sorting by age
df_sorted = df.sort_values(by='age')
print(df_sorted)

# Adding a new column
df['country'] = 'USA'
print(df)

# Filtering rows
df_filtered = df[df['age'] > 25]
print(df_filtered)
```

#### Output:

```
      name  age         city
2  Charlie   22      Chicago
0    Alice   24     New York
1      Bob   27  Los Angeles
4   Edward   29      Phoenix
3    David   32      Houston
```

```
      name  age         city country
0    Alice   24     New York     USA
1      Bob   27  Los Angeles     USA
2  Charlie   22      Chicago     USA
3    David   32      Houston     USA
4   Edward   29      Phoenix     USA
```

```
    name  age         city country
1    Bob   27  Los Angeles     USA
3  David   32      Houston     USA
4 Edward   29      Phoenix     USA
```

### Use Cases of Pandas

- **Data Cleaning**: Handling missing values, duplicates, and transforming raw data into a usable format.
- **Exploratory Data Analysis (EDA)**: Summarizing main characteristics of data, including visualizations.
- **Data Visualization**: Integrating with libraries like Matplotlib and Seaborn for graphical representation of data.
- **Time Series Analysis**: Analyzing and manipulating time series data, such as stock prices or sensor readings.
- **Data Import and Export**: Reading data from and writing data to various file formats and databases.

### Conclusion

Pandas is an indispensable tool for anyone working with data in Python. Its powerful data structures, comprehensive set of functions for data manipulation, and ease of use make it a cornerstone of data analysis and preprocessing tasks. Whether you're cleaning data, performing complex transformations, or preparing data for machine learning models, pandas provides the functionality you need to streamline your workflow and make your data analysis more efficient.

# 2.How to read files using pandas?

Certainly! Let's dive into how to read different types of files using pandas, a powerful data manipulation library in Python. We'll cover reading text files (`.txt`), CSV files (`.csv`), and Excel files (`.xlsx`).

## Lesson: Reading Different File Formats with Pandas

### 1. Reading Text Files (`.txt`)

Text files can contain data in various formats, such as plain text, tab-separated values, or space-separated values. Pandas provides a flexible method for reading text files.

#### Example: Reading a Simple Text File

Assume you have a text file named `data.txt` with the following content:

```
name,age,city
Alice,24,New York
Bob,27,Los Angeles
Charlie,22,Chicago
David,32,Houston
Edward,29,Phoenix
```

#### Reading the Text File

```python
import pandas as pd

# Reading a comma-separated text file
df = pd.read_csv('data.txt')
print(df)
print(df.shape)  # Output: (5, 3)
```

### 2. Reading CSV Files (`.csv`)

CSV (Comma-Separated Values) is one of the most common file formats for data exchange. Pandas makes it very easy to read and write CSV files.

#### Example: Reading a CSV File

Assume you have a CSV file named `data.csv` with the same content as above.

#### Reading the CSV File

```python
import pandas as pd

# Reading a CSV file
df = pd.read_csv('data.csv')
print(df)
print(df.shape)  # Output: (5, 3)
```

#### Customizing the CSV Reading

Pandas provides several parameters to customize the reading process:

- **Specifying a different delimiter** (e.g., semicolon):

```python
df = pd.read_csv('data.csv', delimiter=';')
```

- **Skipping rows**:

```python
df = pd.read_csv('data.csv', skiprows=1)
```

- **Reading only specific columns**:

```python
df = pd.read_csv('data.csv', usecols=['name', 'age'])
```

### 3. Reading Excel Files (`.xlsx`)

Excel files are widely used for storing tabular data. Pandas provides functions to read Excel files, supporting multiple sheets.

#### Example: Reading an Excel File

Assume you have an Excel file named `data.xlsx` with a sheet named `Sheet1` containing the same content as above.

#### Reading the Excel File

```python
import pandas as pd

# Reading an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df)
print(df.shape)  # Output: (5, 3)
```

#### Customizing the Excel Reading

- **Reading multiple sheets**:

```python
# Reading all sheets into a dictionary of DataFrames
dfs = pd.read_excel('data.xlsx', sheet_name=None)
print(dfs.keys())  # Output: dict_keys(['Sheet1'])
print(dfs['Sheet1'])
```

- **Reading specific columns**:

```python
df = pd.read_excel('data.xlsx', usecols=['name', 'age'])
```

### Additional Customizations

Pandas allows you to handle more complex scenarios:

- **Specifying data types**:

```python
df = pd.read_csv('data.csv', dtype={'age': int})
```

- **Handling missing values**:

```python
df = pd.read_csv('data.csv', na_values=['NA', 'Missing'])
```

- **Parsing dates**:

```python
df = pd.read_csv('data.csv', parse_dates=['date'])
```

### Summary

Pandas provides versatile and powerful methods to read various file formats. Here are the key takeaways for reading different files:

- **Text Files**: Use `pd.read_csv()` with appropriate delimiters.
- **CSV Files**: Use `pd.read_csv()`.
- **Excel Files**: Use `pd.read_excel()`.

These functions offer a range of parameters to customize the reading process, making pandas a flexible tool for data ingestion. Here's a final recap with example code for each type:

```python
import pandas as pd

# Reading a text file
df_txt = pd.read_csv('data.txt')
print(df_txt)
print(df_txt.shape)  # Output: (5, 3)

# Reading a CSV file
df_csv = pd.read_csv('data.csv')
print(df_csv)
print(df_csv.shape)  # Output: (5, 3)

# Reading an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df_excel)
print(df_excel.shape)  # Output: (5, 3)
```

By using these methods, you can efficiently load data from various file formats into pandas DataFrames, ready for analysis and manipulation.

# 3.Accessing data frames using pandas



### 1. Accessing Columns

Columns in a pandas DataFrame can be accessed in multiple ways.

#### Accessing a Single Column

You can access a single column using the column name as a key or as an attribute.

```python
import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
        'age': [24, 27, 22, 32, 29],
        'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}
df = pd.DataFrame(data)

# Accessing a column as a key
name_column = df['name']
print(name_column)
print(name_column.shape)  # Output: (5,)

# Accessing a column as an attribute
age_column = df.age
print(age_column)
print(age_column.shape)  # Output: (5,)
```

#### Accessing Multiple Columns

You can access multiple columns by passing a list of column names.

```python
subset = df[['name', 'age']]
print(subset)
print(subset.shape)  # Output: (5, 2)
```

### 2. Accessing Headers

The headers (column names) of a DataFrame can be accessed using the `.columns` attribute.

```python
headers = df.columns
print(headers)  # Output: Index(['name', 'age', 'city'], dtype='object')
```

### 3. Accessing Rows

Rows can be accessed using various methods such as `loc`, `iloc`, and slicing.

#### Accessing Rows by Index with `iloc`

`iloc` is used for integer-location based indexing for selection by position.

```python
first_row = df.iloc[0]
print(first_row)
print(first_row.shape)  # Output: (3,)

# Accessing multiple rows
subset_rows = df.iloc[0:3]
print(subset_rows)
print(subset_rows.shape)  # Output: (3, 3)
```

#### Accessing Rows by Label with `loc`

`loc` is used for label-based indexing.

```python
# Setting custom index
df.set_index('name', inplace=True)

# Accessing a row by label
charlie_row = df.loc['Charlie']
print(charlie_row)
print(charlie_row.shape)  # Output: (2,)

# Accessing multiple rows by labels
subset_rows = df.loc[['Alice', 'David']]
print(subset_rows)
print(subset_rows.shape)  # Output: (2, 2)
```

### 4. Accessing Individual Cells

Individual cells can be accessed using `at` (label-based) and `iat` (integer-location based).

#### Accessing Cells with `at`

```python
# Accessing a cell by row label and column name
cell_value = df.at['Charlie', 'age']
print(cell_value)  # Output: 22
```

#### Accessing Cells with `iat`

```python
# Accessing a cell by row and column indices
cell_value = df.iat[2, 1]  # Row index 2, Column index 1
print(cell_value)  # Output: 22
```

### 5. Iterating Over DataFrames

Sometimes you may need to iterate over rows or columns.

#### Iterating Over Rows

You can iterate over rows using `iterrows()` or `itertuples()`.

```python
# Using iterrows()
for index, row in df.iterrows():
    print(index, row['age'])

# Using itertuples() for faster iteration
for row in df.itertuples():
    print(row.name, row.age)
```

#### Iterating Over Columns

You can iterate over columns using the `items()` method.

```python
for column_name, column_data in df.items():
    print(f"Column: {column_name}")
    print(column_data)
```

### Summary

Pandas provides various methods to access different parts of a DataFrame, making it a versatile tool for data manipulation. Here are the key takeaways:

- **Accessing Columns**: Use `df['column_name']` or `df.column_name`.
- **Accessing Headers**: Use `df.columns`.
- **Accessing Rows**: Use `df.iloc[]` for position-based indexing and `df.loc[]` for label-based indexing.
- **Accessing Cells**: Use `df.at[label, column]` for label-based and `df.iat[row, column]` for position-based access.
- **Iterating**: Use `df.iterrows()`, `df.itertuples()`, and `df.items()` for iteration.

Here's a complete example that combines all these elements:

```python
import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
        'age': [24, 27, 22, 32, 29],
        'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}
df = pd.DataFrame(data)
print(df)
print(df.shape)  # Output: (5, 3)

# Accessing Columns
name_column = df['name']
print(name_column)
print(name_column.shape)  # Output: (5,)

subset = df[['name', 'age']]
print(subset)
print(subset.shape)  # Output: (5, 2)

# Accessing Headers
headers = df.columns
print(headers)  # Output: Index(['name', 'age', 'city'], dtype='object')

# Accessing Rows
first_row = df.iloc[0]
print(first_row)
print(first_row.shape)  # Output: (3,)

subset_rows = df.iloc[0:3]
print(subset_rows)
print(subset_rows.shape)  # Output: (3, 3)

df.set_index('name', inplace=True)
charlie_row = df.loc['Charlie']
print(charlie_row)
print(charlie_row.shape)  # Output: (2,)

subset_rows = df.loc[['Alice', 'David']]
print(subset_rows)
print(subset_rows.shape)  # Output: (2, 2)

# Accessing Individual Cells
cell_value = df.at['Charlie', 'age']
print(cell_value)  # Output: 22

cell_value = df.iat[2, 1]  # Row index 2, Column index 1
print(cell_value)  # Output: 22

# Iterating Over Rows
for index, row in df.iterrows():
    print(index, row['age'])

for row in df.itertuples():
    print(row.Index, row.age)

# Iterating Over Columns
for column_name, column_data in df.items():
    print(f"Column: {column_name}")
    print(column_data)
```

This lesson covers how to access various parts of a DataFrame, which is crucial for data manipulation and analysis in pandas.

# 4.Removing special characters, stop words and doing lemmitaization

Lemmatization, removing stop words, and removing special characters are key steps in preprocessing text data for natural language processing (NLP) and machine learning tasks. Each of these steps contributes to the quality and performance of your model by simplifying the text data and removing noise. Here's why these preprocessing steps are important:

### 1. Removing Special Characters

#### Why:
- **Noise Reduction**: Special characters (like punctuation, symbols) often do not contribute meaningful information for tasks such as text classification, sentiment analysis, or other NLP tasks. Removing them helps in reducing noise in the data.
- **Consistency**: By removing special characters, you standardize the text data, which helps in better feature extraction and representation.

#### Example:
Consider the text: "Hello, World! This is a test."
- Before: "Hello, World! This is a test."
- After removing special characters: "Hello World This is a test"

### 2. Removing Stop Words

#### Why:
- **Irrelevance**: Stop words (like "the", "is", "in", "and") are common words that often do not carry significant meaning and are not useful for distinguishing between different pieces of text.
- **Dimensionality Reduction**: Removing stop words helps in reducing the size of the vocabulary, which in turn reduces the dimensionality of the feature space. This can lead to more efficient and faster model training.
- **Focus on Important Words**: By removing stop words, you retain the words that are more likely to carry significant meaning and information.

#### Example:
Consider the text: "This is a test of the text preprocessing step."
- Before: "This is a test of the text preprocessing step."
- After removing stop words: "test text preprocessing step"

### 3. Lemmatization

#### Why:
- **Normalization**: Lemmatization reduces words to their base or root form (lemma), ensuring that different forms of a word (e.g., "running", "ran", "runs") are treated as the same word ("run"). This helps in normalizing the text data.
- **Improved Matching**: By reducing words to their lemmas, you improve the matching of words, leading to better feature extraction and more meaningful analysis.
- **Vocabulary Reduction**: Lemmatization helps in reducing the size of the vocabulary by consolidating different forms of a word into a single representation.

#### Example:
Consider the text: "The cats are running and the dogs are chasing."
- Before: "The cats are running and the dogs are chasing."
- After lemmatization: "The cat are run and the dog are chase"

### Summary of Benefits

- **Cleaner Data**: Removing special characters and stop words results in cleaner, more relevant data for analysis.
- **Reduced Noise**: These preprocessing steps help in reducing noise, making it easier for the model to learn meaningful patterns.
- **Enhanced Performance**: A more consistent and reduced feature space typically leads to better model performance and faster training times.
- **Improved Accuracy**: By focusing on the meaningful parts of the text, your model is more likely to achieve better accuracy.

### Practical Implementation

Here’s how you can implement these steps in Python using the `nltk` library:

```python
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Remove special characters
    text = re.sub(r'\W', ' ', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stop words and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    
    return ' '.join(tokens)

# Example usage
text = "The cats are running and the dogs are chasing."
cleaned_text = preprocess_text(text)
print(cleaned_text)
```

### Output

```
cat run dog chase
```

This processed text is now ready for feature extraction and model training, with reduced noise and a more consistent vocabulary, which should improve the performance of your machine learning model.