<a href="https://colab.research.google.com/github/GioGio2004/ML-documentation/blob/main/data_manipulation_with_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pandas is a powerful and versatile library in Python used for data manipulation and analysis. Here are some of the main functions and methods in Pandas, organized by their typical usage:

### 1. **Creating Data Structures**
- **`pd.Series(data, index)`**: Create a one-dimensional array-like object.
- **`pd.DataFrame(data, index, columns)`**: Create a two-dimensional, size-mutable, potentially heterogeneous tabular data structure.

### 2. **Reading and Writing Data**
- **`pd.read_csv(filepath)`**: Read a CSV file into a DataFrame.
- **`pd.read_excel(filepath, sheet_name)`**: Read an Excel file into a DataFrame.
- **`pd.read_sql(query, connection)`**: Read SQL query or database table into a DataFrame.
- **`pd.read_json(json_string)`**: Read JSON data into a DataFrame.
- **`pd.read_html(url)`**: Read HTML tables into a list of DataFrames.
- **`pd.to_csv(filepath)`**: Write DataFrame to a CSV file.
- **`pd.to_excel(filepath)`**: Write DataFrame to an Excel file.
- **`pd.to_json(path_or_buf)`**: Write DataFrame to a JSON string.

### 3. **Viewing and Inspecting Data**
- **`df.head(n)`**: Return the first `n` rows.
- **`df.tail(n)`**: Return the last `n` rows.
- **`df.info()`**: Print a concise summary of a DataFrame.
- **`df.describe()`**: Generate descriptive statistics.
- **`df.shape`**: Get the dimensions of the DataFrame (rows, columns).
- **`df.columns`**: Get or set the column labels.

### 4. **Selection and Filtering**
- **`df.loc[]`**: Access a group of rows and columns by labels or a boolean array.
- **`df.iloc[]`**: Access a group of rows and columns by integer position.
- **`df.at[]`**: Access a single value for a row/column label pair.
- **`df.iat[]`**: Access a single value for a row/column pair by integer position.
- **`df[df['column'] > value]`**: Filter rows based on column values.

### 5. **Data Manipulation**
- **`df.assign(**kwargs)`**: Assign new columns to a DataFrame.
- **`df.drop(labels, axis)`**: Drop specified labels from rows or columns.
- **`df.rename(columns={'old_name': 'new_name'})`**: Rename columns.
- **`df.sort_values(by, ascending=True)`**: Sort by the values along either axis.
- **`df.sort_index(axis=0, ascending=True)`**: Sort by the index (row labels).
- **`df.apply(func)`**: Apply a function along an axis of the DataFrame.
- **`df.groupby(by)`**: Group DataFrame using a mapper or by a Series of columns.
- **`df.merge(right, on, how)`**: Merge DataFrame or named Series objects with a database-style join.

### 6. **Handling Missing Data**
- **`df.isnull()`**: Detect missing values.
- **`df.notnull()`**: Detect existing (non-missing) values.
- **`df.dropna(axis)`**: Remove missing values.
- **`df.fillna(value)`**: Fill missing values.

### 7. **Aggregating Data**
- **`df.sum(axis)`**: Return the sum of the values for the requested axis.
- **`df.mean(axis)`**: Return the mean of the values for the requested axis.
- **`df.median(axis)`**: Return the median of the values for the requested axis.
- **`df.std(axis)`**: Return the standard deviation of the values for the requested axis.
- **`df.min(axis)`**: Return the minimum of the values for the requested axis.
- **`df.max(axis)`**: Return the maximum of the values for the requested axis.

### 8. **Time Series**
- **`pd.to_datetime(arg)`**: Convert argument to datetime.
- **`pd.date_range(start, end, periods)`**: Generate a fixed frequency DatetimeIndex.

### 9. **Pivoting and Reshaping Data**
- **`df.pivot(index, columns, values)`**: Create a spreadsheet-style pivot table.
- **`df.pivot_table(values, index, columns, aggfunc)`**: Create a pivot table.
- **`df.melt(id_vars, value_vars)`**: Unpivot a DataFrame from wide format to long format.
- **`df.stack()`**: Pivot the columns of the DataFrame into the index.
- **`df.unstack()`**: Pivot the index of the DataFrame into the columns.

### 10. **Miscellaneous**
- **`pd.concat(objs, axis)`**: Concatenate pandas objects along a particular axis.
- **`pd.merge(left, right, how, on)`**: Merge DataFrame or named Series objects with a database-style join.
- **`pd.crosstab(index, columns)`**: Compute a simple cross-tabulation of two (or more) factors.

These functions provide a broad overview of the capabilities of Pandas for data manipulation and analysis.

Creating data structures is one of the fundamental tasks in Pandas. Here are detailed explanations and examples for creating `Series` and `DataFrame` objects:

### 1. **Creating a Pandas Series**

A `Series` is a one-dimensional array-like object that can hold various types of data. It is similar to a column in a DataFrame.

#### `pd.Series(data, index)`

- **`data`**: Can be a list, dictionary, or NumPy array.
- **`index`**: An optional list of labels for the data.

**Example 1: Creating a Series from a List**
```python
import pandas as pd

# Create a Series from a list
data = [10, 20, 30, 40]
index = ['a', 'b', 'c', 'd']
series = pd.Series(data, index=index)

print(series)
```
Output:
```
a    10
b    20
c    30
d    40
dtype: int64
```

**Example 2: Creating a Series from a Dictionary**
```python
# Create a Series from a dictionary
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
series = pd.Series(data)

print(series)
```
Output:
```
a    10
b    20
c    30
d    40
dtype: int64
```

### 2. **Creating a Pandas DataFrame**

A `DataFrame` is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

#### `pd.DataFrame(data, index, columns)`

- **`data`**: Can be a dictionary of arrays, lists, or another DataFrame.
- **`index`**: An optional list of row labels.
- **`columns`**: An optional list of column labels.

**Example 1: Creating a DataFrame from a Dictionary of Lists**
```python
# Create a DataFrame from a dictionary of lists
data = {
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}
index = ['row1', 'row2', 'row3', 'row4']
df = pd.DataFrame(data, index=index)

print(df)
```
Output:
```
       A  B   C
row1   1  5   9
row2   2  6  10
row3   3  7  11
row4   4  8  12
```

**Example 2: Creating a DataFrame from a List of Dictionaries**
```python
# Create a DataFrame from a list of dictionaries
data = [
    {'A': 1, 'B': 5, 'C': 9},
    {'A': 2, 'B': 6, 'C': 10},
    {'A': 3, 'B': 7, 'C': 11},
    {'A': 4, 'B': 8, 'C': 12}
]
df = pd.DataFrame(data)

print(df)
```
Output:
```
   A  B   C
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
```

**Example 3: Creating a DataFrame from a Dictionary of Series**
```python
# Create a Series for each column
series_A = pd.Series([1, 2, 3, 4], index=['row1', 'row2', 'row3', 'row4'])
series_B = pd.Series([5, 6, 7, 8], index=['row1', 'row2', 'row3', 'row4'])
series_C = pd.Series([9, 10, 11, 12], index=['row1', 'row2', 'row3', 'row4'])

# Create a DataFrame from a dictionary of Series
data = {
    'A': series_A,
    'B': series_B,
    'C': series_C
}
df = pd.DataFrame(data)

print(df)
```
Output:
```
       A  B   C
row1   1  5   9
row2   2  6  10
row3   3  7  11
row4   4  8  12
```

These examples demonstrate the flexibility and ease of creating Pandas `Series` and `DataFrame` objects from various types of data sources.

############################################################################################################################################################

Reading and writing data are essential tasks in data analysis, and Pandas provides a variety of functions to handle these operations. Here are detailed examples of each function:

### Reading Data

#### 1. **Reading a CSV File**
**`pd.read_csv(filepath)`**: Read a CSV file into a DataFrame.

Example:
```python
import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv('data.csv')

print(df)
```

#### 2. **Reading an Excel File**
**`pd.read_excel(filepath, sheet_name)`**: Read an Excel file into a DataFrame.

Example:
```python
# Read an Excel file into a DataFrame
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

print(df)
```

#### 3. **Reading from a SQL Query or Database Table**
**`pd.read_sql(query, connection)`**: Read SQL query or database table into a DataFrame.

Example:
```python
import pandas as pd
import sqlite3

# Create a connection to the SQLite database
conn = sqlite3.connect('database.db')

# Read SQL query into a DataFrame
df = pd.read_sql('SELECT * FROM table_name', conn)

print(df)

# Close the connection
conn.close()
```

#### 4. **Reading JSON Data**
**`pd.read_json(json_string)`**: Read JSON data into a DataFrame.

Example:
```python
# Read JSON data into a DataFrame
df = pd.read_json('data.json')

print(df)
```

#### 5. **Reading HTML Tables**
**`pd.read_html(url)`**: Read HTML tables into a list of DataFrames.

Example:
```python
# Read HTML tables into a list of DataFrames
dfs = pd.read_html('https://example.com/tables.html')

# Print the first table
print(dfs[0])
```

### Writing Data

#### 1. **Writing to a CSV File**
**`df.to_csv(filepath)`**: Write DataFrame to a CSV file.

Example:
```python
# Write DataFrame to a CSV file
df.to_csv('output.csv', index=False)
```

#### 2. **Writing to an Excel File**
**`df.to_excel(filepath)`**: Write DataFrame to an Excel file.

Example:
```python
# Write DataFrame to an Excel file
df.to_excel('output.xlsx', index=False, sheet_name='Sheet1')
```

#### 3. **Writing to a JSON String**
**`df.to_json(path_or_buf)`**: Write DataFrame to a JSON string or file.

Example:
```python
# Write DataFrame to a JSON string
json_string = df.to_json()

# Write DataFrame to a JSON file
df.to_json('output.json')
```

These functions enable you to efficiently read data from various sources and write data to different formats, facilitating the data manipulation and analysis process in Pandas.

#############################################################

Viewing and inspecting data are crucial steps in data analysis to understand the structure and contents of a DataFrame. Here are examples of key Pandas functions and attributes for viewing and inspecting data:

### 1. **Viewing the First Few Rows**
#### `df.head(n)`
Returns the first `n` rows of the DataFrame. If `n` is not provided, it returns the first 5 rows by default.
```python
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# View the first 3 rows
print(df.head(3))
```

Output:
```
   A   B
0  1  10
1  2  20
2  3  30
```

### 2. **Viewing the Last Few Rows**
#### `df.tail(n)`
Returns the last `n` rows of the DataFrame. If `n` is not provided, it returns the last 5 rows by default.
```python
# View the last 2 rows
print(df.tail(2))
```

Output:
```
   A   B
3  4  40
4  5  50
```

### 3. **Printing a Concise Summary of the DataFrame**
#### `df.info()`
Prints a concise summary of the DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.
```python
# Print a concise summary of the DataFrame
df.info()
```

Output:
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int64
 1   B       5 non-null      int64
dtypes: int64(2)
memory usage: 208.0 bytes
```

### 4. **Generating Descriptive Statistics**
#### `df.describe()`
Generates descriptive statistics of the DataFrame, including count, mean, std (standard deviation), min, 25%, 50%, 75%, and max for numerical columns.
```python
# Generate descriptive statistics
print(df.describe())
```

Output:
```
         A     B
count  5.0   5.0
mean   3.0  30.0
std    1.6  15.8
min    1.0  10.0
25%    2.0  20.0
50%    3.0  30.0
75%    4.0  40.0
max    5.0  50.0
```

### 5. **Getting the Dimensions of the DataFrame**
#### `df.shape`
Returns a tuple representing the dimensionality of the DataFrame (number of rows, number of columns).
```python
# Get the dimensions of the DataFrame
print(df.shape)
```

Output:
```
(5, 2)
```

### 6. **Getting or Setting the Column Labels**
#### `df.columns`
Returns an Index object containing the column labels of the DataFrame. You can also use it to set new column labels.
```python
# Get the column labels
print(df.columns)

# Set new column labels
df.columns = ['Column1', 'Column2']
print(df.columns)
```

Output:
```
Index(['A', 'B'], dtype='object')

Index(['Column1', 'Column2'], dtype='object')
```

These functions and attributes are useful for quickly examining the contents and structure of a DataFrame, which helps in understanding the data and planning subsequent analysis steps.

######################################################

Selection and filtering in Pandas allow you to access and manipulate specific parts of your DataFrame efficiently. Here are the methods you mentioned, explained with examples:

### 1. **Accessing Data by Labels with `loc`**

#### `df.loc[]`
Access a group of rows and columns by labels or a boolean array. It can be used for label-based indexing.

**Example 1: Selecting Rows and Columns by Labels**
```python
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4], 'B': [10, 20, 30, 40], 'C': [100, 200, 300, 400]}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4'])

# Select a specific row by label
print(df.loc['row2'])

# Select specific rows and columns by labels
print(df.loc[['row2', 'row4'], ['A', 'C']])
```

Output:
```
A      2
B     20
C    200
Name: row2, dtype: int64

       A    C
row2   2  200
row4   4  400
```

**Example 2: Using Boolean Array**
```python
# Select rows where column 'A' is greater than 2
print(df.loc[df['A'] > 2])
```

Output:
```
       A   B    C
row3   3  30  300
row4   4  40  400
```

### 2. **Accessing Data by Integer Position with `iloc`**

#### `df.iloc[]`
Access a group of rows and columns by integer position (location-based indexing).

**Example 1: Selecting Rows and Columns by Integer Position**
```python
# Select a specific row by integer position
print(df.iloc[1])

# Select specific rows and columns by integer positions
print(df.iloc[[1, 3], [0, 2]])
```

Output:
```
A      2
B     20
C    200
Name: row2, dtype: int64

       A    C
row2   2  200
row4   4  400
```

### 3. **Accessing Single Values with `at` and `iat`**

#### `df.at[]`
Access a single value for a row/column label pair.

**Example: Accessing a Single Value by Label Pair**
```python
# Access a single value using labels
print(df.at['row2', 'B'])
```

Output:
```
20
```

#### `df.iat[]`
Access a single value for a row/column pair by integer position.

**Example: Accessing a Single Value by Integer Position**
```python
# Access a single value using integer positions
print(df.iat[1, 1])
```

Output:
```
20
```

### 4. **Filtering Rows Based on Column Values**

**Example: Filtering Rows Where Column 'B' Is Greater Than a Value**
```python
# Filter rows where column 'B' is greater than 20
filtered_df = df[df['B'] > 20]

print(filtered_df)
```

Output:
```
       A   B    C
row3   3  30  300
row4   4  40  400
```

These selection and filtering methods allow for flexible and efficient data access and manipulation, enabling you to work effectively with your DataFrame in Pandas.

########################################

Data manipulation is a fundamental part of data analysis in Pandas. Below are examples of the specified methods for manipulating data within a DataFrame.

### 1. **Assigning New Columns**

#### `df.assign(kwargs)`
Creates new columns or modifies existing ones by assigning values to them.

**Example: Assigning a New Column**
```python
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Assign a new column 'C'
df = df.assign(C=df['A'] + df['B'])

print(df)
```

Output:
```
   A  B  C
0  1  4  5
1  2  5  7
2  3  6  9
```

### 2. **Dropping Rows or Columns**

#### `df.drop(labels, axis)`
Drops specified labels from rows or columns.

**Example: Dropping a Column**
```python
# Drop column 'B'
df = df.drop('B', axis=1)

print(df)
```

Output:
```
   A  C
0  1  5
1  2  7
2  3  9
```

**Example: Dropping a Row**
```python
# Drop row with index 0
df = df.drop(0, axis=0)

print(df)
```

Output:
```
   A  C
1  2  7
2  3  9
```

### 3. **Renaming Columns**

#### `df.rename(columns={'old_name': 'new_name'})`
Renames columns in the DataFrame.

**Example: Renaming a Column**
```python
# Rename column 'A' to 'Alpha'
df = df.rename(columns={'A': 'Alpha'})

print(df)
```

Output:
```
   Alpha  C
1      2  7
2      3  9
```

### 4. **Sorting by Values**

#### `df.sort_values(by, ascending=True)`
Sorts the DataFrame by the values along either axis.

**Example: Sorting by Column Values**
```python
# Sort by column 'C' in descending order
df = df.sort_values(by='C', ascending=False)

print(df)
```

Output:
```
   Alpha  C
2      3  9
1      2  7
```

### 5. **Sorting by Index**

#### `df.sort_index(axis=0, ascending=True)`
Sorts the DataFrame by the index.

**Example: Sorting by Index**
```python
# Sort by index in ascending order
df = df.sort_index(axis=0, ascending=True)

print(df)
```

Output:
```
   Alpha  C
1      2  7
2      3  9
```

### 6. **Applying a Function Along an Axis**

#### `df.apply(func)`
Applies a function along an axis of the DataFrame.

**Example: Applying a Function to Columns**
```python
# Define a function to multiply by 2
def multiply_by_2(x):
    return x * 2

# Apply the function to each column
df = df.apply(multiply_by_2)

print(df)
```

Output:
```
   Alpha   C
1      4  14
2      6  18
```

### 7. **Grouping Data**

#### `df.groupby(by)`
Groups the DataFrame using a mapper or by a Series of columns.

**Example: Grouping and Calculating Sum**
```python
# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'B'], 'Values': [10, 20, 30, 40]}
df = pd.DataFrame(data)

# Group by 'Category' and calculate sum
grouped = df.groupby('Category').sum()

print(grouped)
```

Output:
```
          Values
Category        
A              40
B              60
```

### 8. **Merging DataFrames**

#### `df.merge(right, on, how)`
Merges two DataFrames using a database-style join.

**Example: Merging Two DataFrames**
```python
# Sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})

# Merge DataFrames on 'key' column
merged_df = df1.merge(df2, on='key', how='inner')

print(merged_df)
```

Output:
```
  key  value1  value2
0   A       1       4
1   B       2       5
```

These methods and functions are powerful tools for data manipulation in Pandas, allowing you to reshape and analyze your data effectively.

In [2]:
# prompt: give me simple pandas code to read csv file

import pandas as pd

df = pd.read_csv('sample_data/california_housing_test.csv')

# Print the DataFrame
print(df)


      longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0       -122.05     37.37                27.0       3885.0           661.0   
1       -118.30     34.26                43.0       1510.0           310.0   
2       -117.81     33.78                27.0       3589.0           507.0   
3       -118.36     33.82                28.0         67.0            15.0   
4       -119.67     36.33                19.0       1241.0           244.0   
...         ...       ...                 ...          ...             ...   
2995    -119.86     34.42                23.0       1450.0           642.0   
2996    -118.14     34.06                27.0       5257.0          1082.0   
2997    -119.70     36.30                10.0        956.0           201.0   
2998    -117.12     34.10                40.0         96.0            14.0   
2999    -119.63     34.42                42.0       1765.0           263.0   

      population  households  median_income  median_house_value

In [3]:
df.describe()california_housing_test

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,-119.5892,35.63539,28.845333,2599.578667,529.950667,1402.798667,489.912,3.807272,205846.275
std,1.994936,2.12967,12.555396,2155.593332,415.654368,1030.543012,365.42271,1.854512,113119.68747
min,-124.18,32.56,1.0,6.0,2.0,5.0,2.0,0.4999,22500.0
25%,-121.81,33.93,18.0,1401.0,291.0,780.0,273.0,2.544,121200.0
50%,-118.485,34.27,29.0,2106.0,437.0,1155.0,409.5,3.48715,177650.0
75%,-118.02,37.69,37.0,3129.0,636.0,1742.75,597.25,4.656475,263975.0
max,-114.49,41.92,52.0,30450.0,5419.0,11935.0,4930.0,15.0001,500001.0


In [4]:
df.drop_duplicates()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


Dropping duplicates in a Pandas DataFrame can be done using the `drop_duplicates()` method. This method can remove duplicate rows based on specific columns or all columns. Here are some examples of how to use it:

### 1. **Drop Duplicate Rows Based on All Columns**
To drop rows that are completely duplicate across all columns:
```python
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4],
    'B': ['a', 'b', 'b', 'c', 'd', 'd']
})

# Drop duplicate rows
df = df.drop_duplicates()

print(df)
```

### 2. **Drop Duplicate Rows Based on Specific Columns**
To drop rows that have duplicate values in specific columns:
```python
# Drop duplicates based on column 'A'
df = df.drop_duplicates(subset=['A'])

print(df)
```

### 3. **Keep the First or Last Occurrence**
By default, `drop_duplicates()` keeps the first occurrence of each set of duplicates. You can change this behavior by using the `keep` parameter:
- `keep='first'` (default): Keep the first occurrence.
- `keep='last'`: Keep the last occurrence.
- `keep=False`: Drop all duplicates.

Example:
```python
# Keep the last occurrence of duplicates
df = df.drop_duplicates(keep='last')

print(df)

# Drop all duplicates
df = df.drop_duplicates(keep=False)

print(df)
```

### 4. **In-Place Modification**
To modify the original DataFrame without creating a new one, use the `inplace` parameter:
```python
# Drop duplicates in place
df.drop_duplicates(inplace=True)

print(df)
```

### 5. **Consideration of Index**
The `drop_duplicates()` method does not consider the DataFrame index for identifying duplicates. If you need to consider the index, you should reset the index first:
```python
# Reset index before dropping duplicates
df.reset_index(drop=True, inplace=True)

# Drop duplicates
df.drop_duplicates(inplace=True)

print(df)
```

These examples cover the basic usage of `drop_duplicates()` in Pandas for removing duplicate rows in a DataFrame.

In [7]:
df.isnull().sum()
df.notnull().sum()

longitude             3000
latitude              3000
housing_median_age    3000
total_rooms           3000
total_bedrooms        3000
population            3000
households            3000
median_income         3000
median_house_value    3000
dtype: int64

Handling missing data is a crucial part of data cleaning and preprocessing in Pandas. Here are some detailed examples of how to use the provided methods for handling missing data:

### 1. **Detect Missing Values**

#### `df.isnull()`
This method returns a DataFrame of the same shape as the original, but with boolean values indicating whether an entry is `NaN` (missing).
```python
import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4],
    'C': [1, np.nan, 3, 4]
})

# Detect missing values
missing_values = df.isnull()

print(missing_values)
```

#### `df.notnull()`
This method returns a DataFrame of the same shape as the original, but with boolean values indicating whether an entry is not `NaN` (not missing).
```python
# Detect non-missing values
non_missing_values = df.notnull()

print(non_missing_values)
```

### 2. **Remove Missing Values**

#### `df.dropna(axis=0)`
This method removes rows (default) or columns (if `axis=1`) that contain missing values.
```python
# Drop rows with any missing values
df_dropped_rows = df.dropna()

print(df_dropped_rows)

# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)

print(df_dropped_columns)
```

You can also drop rows or columns only if all values are missing by setting the `how` parameter:
```python
# Drop rows only if all values are missing
df_dropped_all_rows = df.dropna(how='all')

print(df_dropped_all_rows)

# Drop columns only if all values are missing
df_dropped_all_columns = df.dropna(axis=1, how='all')

print(df_dropped_all_columns)
```

Additionally, you can specify a subset of columns to consider when dropping:
```python
# Drop rows if any of the specified columns have missing values
df_dropped_subset = df.dropna(subset=['A', 'B'])

print(df_dropped_subset)
```

### 3. **Fill Missing Values**

#### `df.fillna(value)`
This method fills in missing values with a specified value. The `value` parameter can be a scalar, dictionary, or DataFrame.
```python
# Fill missing values with a specific value
df_filled_value = df.fillna(0)

print(df_filled_value)

# Fill missing values with different values for each column
df_filled_dict = df.fillna({'A': 0, 'B': 1, 'C': 2})

print(df_filled_dict)
```

You can also fill using different methods, such as forward fill (`ffill`) or backward fill (`bfill`):
```python
# Forward fill (fill with the previous value)
df_filled_ffill = df.fillna(method='ffill')

print(df_filled_ffill)

# Backward fill (fill with the next value)
df_filled_bfill = df.fillna(method='bfill')

print(df_filled_bfill)
```

These examples illustrate how to detect, remove, and fill missing values in a Pandas DataFrame using the provided methods. Adjust the parameters to fit your specific data cleaning needs.

Certainly! Here are code examples for each category of Pandas functions and methods:

### 1. **Creating Data Structures**

#### Creating a Series:
```python
import pandas as pd

# Creating a Series
data = [1, 2, 3, 4, 5]
index = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(data, index=index)

print(s)
```

#### Creating a DataFrame:
```python
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

print(df)
```

### 2. **Reading and Writing Data**

#### Reading from CSV:
```python
# Reading from CSV
df_csv = pd.read_csv('data.csv')

print(df_csv.head())
```

#### Writing to CSV:
```python
# Writing to CSV
df.to_csv('output.csv', index=False)
```

### 3. **Viewing and Inspecting Data**

#### Viewing first few rows:
```python
# Viewing first few rows
print(df.head())
```

#### Getting DataFrame info:
```python
# Getting DataFrame info
print(df.info())
```

### 4. **Selection and Filtering**

#### Selecting using loc:
```python
# Selecting using loc
print(df.loc[df['Age'] > 30])
```

### 5. **Data Manipulation**

#### Assigning a new column:
```python
# Assigning a new column
df['Senior'] = df['Age'] > 30

print(df)
```

#### Sorting values:
```python
# Sorting values
df_sorted = df.sort_values(by='Age', ascending=False)

print(df_sorted)
```

### 6. **Handling Missing Data**

#### Checking for null values:
```python
# Checking for null values
print(df.isnull().sum())
```

#### Dropping rows with missing values:
```python
# Dropping rows with missing values
df_clean = df.dropna()

print(df_clean)
```

### 7. **Aggregating Data**

#### Calculating sum and mean:
```python
# Calculating sum and mean
print("Sum of Ages:", df['Age'].sum())
print("Mean Age:", df['Age'].mean())
```

### 8. **Time Series**

#### Creating a DatetimeIndex:
```python
# Creating a DatetimeIndex
dates = pd.date_range('2023-01-01', periods=5)
ts = pd.Series(range(5), index=dates)

print(ts)
```

### 9. **Pivoting and Reshaping Data**

#### Creating a pivot table:
```python
# Creating a pivot table
pivot_table = df.pivot_table(index='Name', columns='City', values='Age', aggfunc='mean')

print(pivot_table)
```

### 10. **Miscellaneous**

#### Concatenating DataFrames:
```python
# Concatenating DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']})

df_concat = pd.concat([df1, df2])

print(df_concat)
```

#### Cross-tabulation:
```python
# Cross-tabulation
cross_tab = pd.crosstab(df['City'], df['Senior'])

print(cross_tab)
```

These examples cover a wide range of Pandas functionalities, demonstrating how to create data structures, read/write data, view/inspect data, select/filter data, manipulate data, handle missing data, aggregate data, work with time series, reshape data, and use miscellaneous functions like concatenation and cross-tabulation. Each example illustrates typical usage scenarios to help you get started with Pandas effectively.