# 1. Getting Familiar with Pandas
Pandas is a powerful data manipulation library in Python. It provides two primary data structures: DataFrames and Series.

### DataFrames and Series
- **DataFrame**: A 2-dimensional labeled data structure with columns of potentially different types.
- **Series**: A 1-dimensional labeled array capable of holding any data type.

### Creating DataFrames and Series
You can create DataFrames and Series from various data sources like lists, dictionaries, and CSV files.

#### Creating a Series from a List

In [2]:
import pandas as pd

# Create a Series
data = [10, 20, 30, 40, 50]
series = pd.Series(data, name='Numbers')
print(series)

0    10
1    20
2    30
3    40
4    50
Name: Numbers, dtype: int64


#### Creating a DataFrame from a Dictionary

In [3]:

# Create a DataFrame
data = {
    'Car': ['Toyota', 'Honda', 'Ford'],
    'Price': [20000, 22000, 25000],
    'Rating': [4.5, 4.0, 4.2]
}
df = pd.DataFrame(data)
print(df)

      Car  Price  Rating
0  Toyota  20000     4.5
1   Honda  22000     4.0
2    Ford  25000     4.2


#### Creating a DataFrame from a CSV File

In [4]:
# Create a dictionary with sample data
data = {
    'Category': ['Car', 'Car', 'Film', 'Film', 'Game', 'Game', 'Fruit', 'Fruit'],
    'Name': ['Toyota', 'Honda', 'Inception', 'Avatar', 'Minecraft', 'Fortnite', 'Apple', 'Banana'],
    'Price': [20000, 22000, None, 150, 30, 20, 1, 0.5],
    'Rating': [4.5, 4.0, 8.8, 7.8, 9.0, 8.5, 4.2, 4.0],
    'Year': [2020, 2021, 2010, 2009, 2011, 2017, None, None]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

# Read data from a CSV file
df = pd.read_csv('data.csv')
print(df.head())

  Category       Name    Price  Rating    Year
0      Car     Toyota  20000.0     4.5  2020.0
1      Car      Honda  22000.0     4.0  2021.0
2     Film  Inception      NaN     8.8  2010.0
3     Film     Avatar    150.0     7.8  2009.0
4     Game  Minecraft     30.0     9.0  2011.0



### Common Operations
1. **Selecting Data**: Use `.loc[]` and `.iloc[]` for label-based and integer-based indexing, respectively.
2. **Filtering Rows**: Use boolean indexing to filter rows based on conditions.
3. **Modifying Data**: Modify data by assigning new values or using functions.

#### Selecting Data

In [9]:
# Select a column
print(df['Category'])

# Select rows and columns using .loc
print(df.loc[0])
print(df.loc[:, ['Category', 'Price']])

0      Car
1      Car
2     Film
3     Film
4     Game
5     Game
6    Fruit
7    Fruit
Name: Category, dtype: object
Category                Car
Name                 Toyota
Price               20000.0
Rating                  4.5
Year                 2020.0
Discounted Price    18000.0
Name: 0, dtype: object
  Category    Price
0      Car  20000.0
1      Car  22000.0
2     Film      NaN
3     Film    150.0
4     Game     30.0
5     Game     20.0
6    Fruit      1.0
7    Fruit      0.5


#### Filtering Rows

In [8]:
# Filter rows where Price is greater than 21000
filtered_df = df[df['Price'] > 21000]
print(filtered_df)

  Category   Name    Price  Rating    Year  Discounted Price
1      Car  Honda  22000.0     4.0  2021.0           19800.0


#### Filtering Rows

In [7]:
# Filter rows where Price is greater than 21000
filtered_df = df[df['Price'] > 21000]
print(filtered_df)

  Category   Name    Price  Rating    Year  Discounted Price
1      Car  Honda  22000.0     4.0  2021.0           19800.0


#### Modifying Data

In [6]:
# Add a new column
df['Discounted Price'] = df['Price'] * 0.9
print(df)

  Category       Name    Price  Rating    Year  Discounted Price
0      Car     Toyota  20000.0     4.5  2020.0          18000.00
1      Car      Honda  22000.0     4.0  2021.0          19800.00
2     Film  Inception      NaN     8.8  2010.0               NaN
3     Film     Avatar    150.0     7.8  2009.0            135.00
4     Game  Minecraft     30.0     9.0  2011.0             27.00
5     Game   Fortnite     20.0     8.5  2017.0             18.00
6    Fruit      Apple      1.0     4.2     NaN              0.90
7    Fruit     Banana      0.5     4.0     NaN              0.45


In [10]:
import pandas as pd

# Create a Series
data = [10, 20, 30, 40, 50]
series = pd.Series(data, name='Numbers')
print(series)

# Create a DataFrame
data = {
    'Car': ['Toyota', 'Honda', 'Ford'],
    'Price': [20000, 22000, 25000],
    'Rating': [4.5, 4.0, 4.2]
}
df = pd.DataFrame(data)
print(df)

0    10
1    20
2    30
3    40
4    50
Name: Numbers, dtype: int64
      Car  Price  Rating
0  Toyota  20000     4.5
1   Honda  22000     4.0
2    Ford  25000     4.2


# 2. Data Handling with Pandas
Pandas provides powerful tools for handling and preprocessing data.

### Reading Data from Files
Use `pd.read_csv()` to read data from CSV files.

### Handling Missing Data
- **Fill Missing Values**: Use `.fillna()` to fill missing values with a specified value or method.
- **Drop Missing Values**: Use `.dropna()` to remove rows or columns with missing values.

### Transforming Data
- **Convert Data Types**: Use `.astype()` to convert data types.
- **Remove Duplicates**: Use `.drop_duplicates()` to remove duplicate rows.

#### Example: Reading Data from a CSV File


In [11]:
import numpy as np
# Create a dictionary with sample data including missing values
data = {
    'Category': ['Car', 'Car', 'Film', 'Film', 'Game', 'Game', 'Fruit', 'Fruit', 'Car', 'Game'],
    'Name': ['Toyota', 'Honda', 'Inception', 'Avatar', 'Minecraft', 'Fortnite', 'Apple', 'Banana', 'BMW', 'The Sims'],
    'Price': [20000, np.nan, np.nan, 150, 30, 20, np.nan, 0.5, 25000, np.nan],
    'Rating': [4.5, 4.0, 8.8, np.nan, 9.0, 8.5, 4.2, np.nan, 4.3, 7.5],
    'Year': [2020, 2021, 2010, 2009, np.nan, 2017, np.nan, np.nan, 2022, np.nan]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

#  Read data from a CSV file
df = pd.read_csv('data.csv')
print(df.head())

  Category       Name    Price  Rating    Year
0      Car     Toyota  20000.0     4.5  2020.0
1      Car      Honda      NaN     4.0  2021.0
2     Film  Inception      NaN     8.8  2010.0
3     Film     Avatar    150.0     NaN  2009.0
4     Game  Minecraft     30.0     9.0     NaN


#### Example: Handling Missing Data

In [32]:
# Fill missing values
df['Price'] = df['Price'].fillna(df['Price'].mean())

# Drop rows with missing values
df.dropna(inplace=True)

#### Example: Transforming Data

In [13]:
# Convert data type
df['Price'] = df['Price'].astype(float)

# Remove duplicates
df.drop_duplicates(inplace=True)

In [31]:
# Example of reading data from a CSV file
df = pd.DataFrame({
    'Car': ['Toyota', 'Honda', 'Ford'],
    'Price': [20000, 22000, 25000],
    'Rating': [4.5, 4.0, 4.2]
})
print(df)

# Handling missing data
df.loc[1, 'Price'] = None  # Introduce a missing value
df['Price'] = df['Price'].fillna(df['Price'].mean())
df.dropna(inplace=True)  # Drop rows with missing values (if any)

# Transforming data
df['Price'] = df['Price'].astype(float)
df.drop_duplicates(inplace=True)

print(df)

# Add a new column for demonstration
df['Discounted Price'] = df['Price'] * 0.9
print(df)


      Car  Price  Rating
0  Toyota  20000     4.5
1   Honda  22000     4.0
2    Ford  25000     4.2
      Car    Price  Rating
0  Toyota  20000.0     4.5
1   Honda  22500.0     4.0
2    Ford  25000.0     4.2
      Car    Price  Rating  Discounted Price
0  Toyota  20000.0     4.5           18000.0
1   Honda  22500.0     4.0           20250.0
2    Ford  25000.0     4.2           22500.0


# 3. Data Analysis with Pandas


Pandas is excellent for performing data analysis tasks.

### Generating Summary Statistics
You can use functions like `.describe()`, `.mean()`, `.median()`, and `.std()` to generate summary statistics.

### Grouping Data
Use `.groupby()` to group data and apply aggregate functions to each group.

### Advanced Data Manipulation
- **Merging DataFrames**: Use `.merge()` to combine DataFrames based on common columns.
- **Joining DataFrames**: Use `.join()` to join DataFrames on indices.
- **Concatenating DataFrames**: Use `pd.concat()` to concatenate DataFrames along a particular axis.

#### Example: Generating Summary Statistics

In [15]:
# Summary statistics
print(df.describe())
print(df['Price'].mean())
print(df['Rating'].median())
print(df['Price'].std())


         Price    Rating  Discounted Price
count      3.0  3.000000               3.0
mean   22500.0  4.233333           20250.0
std     2500.0  0.251661            2250.0
min    20000.0  4.000000           18000.0
25%    21250.0  4.100000           19125.0
50%    22500.0  4.200000           20250.0
75%    23750.0  4.350000           21375.0
max    25000.0  4.500000           22500.0
22500.0
4.2
2500.0


#### Example: Grouping Data

In [16]:
# Group by 'Car' and calculate mean price
grouped_df = df.groupby('Car')['Price'].mean()
print(grouped_df)

Car
Ford      25000.0
Honda     22500.0
Toyota    20000.0
Name: Price, dtype: float64


#### Example: Merging DataFrames

In [17]:
# Create another DataFrame
df2 = pd.DataFrame({
    'Car': ['Toyota', 'Honda'],
    'Year': [2020, 2021]
})

# Merge DataFrames
merged_df = pd.merge(df, df2, on='Car')
print(merged_df)

      Car    Price  Rating  Discounted Price  Year
0  Toyota  20000.0     4.5           18000.0  2020
1   Honda  22500.0     4.0           20250.0  2021


#### Example: Concatenating DataFrames

In [18]:
# Create another DataFrame
df3 = pd.DataFrame({
    'Car': ['BMW'],
    'Price': [30000],
    'Rating': [4.3]
})

# Concatenate DataFrames
concat_df = pd.concat([df, df3], ignore_index=True)
print(concat_df)

      Car    Price  Rating  Discounted Price
0  Toyota  20000.0     4.5           18000.0
1   Honda  22500.0     4.0           20250.0
2    Ford  25000.0     4.2           22500.0
3     BMW  30000.0     4.3               NaN


In [19]:
# Example of generating summary statistics
print(df.describe())
print(df['Price'].mean())
print(df['Rating'].median())
print(df['Price'].std())

# Group by 'Car' and calculate mean price
grouped_df = df.groupby('Car')['Price'].mean()
print(grouped_df)

# Create another DataFrame
df2 = pd.DataFrame({
    'Car': ['Toyota', 'Honda'],
    'Year': [2020, 2021]
})

# Merge DataFrames
merged_df = pd.merge(df, df2, on='Car')
print(merged_df)

# Create another DataFrame
df3 = pd.DataFrame({
    'Car': ['BMW'],
    'Price': [30000],
    'Rating': [4.3]
})

# Concatenate DataFrames
concat_df = pd.concat([df, df3], ignore_index=True)
print(concat_df)


         Price    Rating  Discounted Price
count      3.0  3.000000               3.0
mean   22500.0  4.233333           20250.0
std     2500.0  0.251661            2250.0
min    20000.0  4.000000           18000.0
25%    21250.0  4.100000           19125.0
50%    22500.0  4.200000           20250.0
75%    23750.0  4.350000           21375.0
max    25000.0  4.500000           22500.0
22500.0
4.2
2500.0
Car
Ford      25000.0
Honda     22500.0
Toyota    20000.0
Name: Price, dtype: float64
      Car    Price  Rating  Discounted Price  Year
0  Toyota  20000.0     4.5           18000.0  2020
1   Honda  22500.0     4.0           20250.0  2021
      Car    Price  Rating  Discounted Price
0  Toyota  20000.0     4.5           18000.0
1   Honda  22500.0     4.0           20250.0
2    Ford  25000.0     4.2           22500.0
3     BMW  30000.0     4.3               NaN


# 4. Application in Data Science
Pandas is a cornerstone of data science due to its ability to handle, manipulate, and analyze data efficiently.

### Advantages Over Traditional Data Structures
- **Efficiency**: Pandas is optimized for performance and can handle large datasets more efficiently than native Python data structures.
- **Flexibility**: Provides a wide range of functionalities for data manipulation and analysis.
- **Integration**: Works well with other libraries like NumPy, Matplotlib, and Scikit-learn.

### Real-World Examples
- **Data Cleaning**: Removing duplicates, handling missing values, and data transformation.
- **Exploratory Data Analysis (EDA)**: Analyzing datasets to summarize their main characteristics.
- **Machine Learning**: Preprocessing data for machine learning models, including feature extraction and data preparation.

#### Example: Data Cleaning and EDA


In [33]:
movies_data = {
    'Title': ['Inception', 'The Matrix', 'Avatar', 'The Dark Knight', 'Titanic', 'The Shawshank Redemption', 'Pulp Fiction', 'The Godfather', 'Forrest Gump', 'The Lord of the Rings: The Return of the King'],
    'Genre': ['Sci-Fi', 'Sci-Fi', 'Action', 'Action', 'Romance', 'Drama', 'Crime', 'Crime', 'Drama', 'Fantasy'],
    'Release_Year': [2000, 1998, 2005, 2008, np.nan, 1994, 1994, 1972, 1994, 2003],
    'Rating': [8.8, np.nan, 7.8, 9.0, 7.8, 9.3, 8.9, np.nan, 8.8, 8.9]
}

# Create a DataFrame
movies_df = pd.DataFrame(movies_data)

# Save the DataFrame to a CSV file
movies_df.to_csv('movies.csv', index=False)

# Load a dataset
df = pd.read_csv('movies.csv')

# Drop duplicates
df.drop_duplicates(inplace=True)

# Handle missing values
df["Rating"] = df["Rating"].fillna(df["Rating"].mean())

# Basic EDA
print(df.describe())
print(df['Rating'].value_counts())

       Release_Year     Rating
count      9.000000  10.000000
mean    1996.444444   8.662500
std       10.489413   0.489756
min     1972.000000   7.800000
25%     1994.000000   8.662500
50%     1998.000000   8.800000
75%     2003.000000   8.900000
max     2008.000000   9.300000
Rating
8.8000    2
8.6625    2
7.8000    2
8.9000    2
9.0000    1
9.3000    1
Name: count, dtype: int64


In [34]:
# Example of data cleaning and EDA
df = pd.DataFrame({
    'Car': ['Toyota', 'Honda', 'Ford', 'BMW'],
    'Price': [20000, 22000, 25000, 30000],
    'Rating': [4.5, 4.0, 4.2, 4.3],
    'Year': [2020, 2021, None, 2021]
})

# Drop duplicates
df.drop_duplicates(inplace=True)

# Handle missing values
df.fillna(df.mean(numeric_only=True), inplace=True)

# Basic EDA
print(df.describe())
print(df['Rating'].value_counts())


             Price    Rating         Year
count      4.00000  4.000000     4.000000
mean   24250.00000  4.250000  2020.666667
std     4349.32945  0.208167     0.471405
min    20000.00000  4.000000  2020.000000
25%    21500.00000  4.150000  2020.500000
50%    23500.00000  4.250000  2020.833333
75%    26250.00000  4.350000  2021.000000
max    30000.00000  4.500000  2021.000000
Rating
4.5    1
4.0    1
4.2    1
4.3    1
Name: count, dtype: int64


# Summary of Findings and Benefits of Pandas for Data Science Professionals

## 1. Data Handling with Pandas

- **DataFrames and Series**: Pandas provides two primary data structures, DataFrames and Series, which make handling and manipulating data straightforward. DataFrames are akin to SQL tables or Excel spreadsheets, while Series is a one-dimensional array.
- **Data Creation**: You can create DataFrames and Series from lists, dictionaries, and CSV files. This flexibility allows for efficient data import and export.
- **Data Manipulation**: Operations such as selecting specific rows or columns, filtering data, and modifying values are simple with Pandas, which is crucial for preprocessing and cleaning data.

## 2. Data Cleaning and Preprocessing

- **Handling Missing Values**: Pandas provides methods to identify and manage missing data, such as `fillna()` to fill missing values and `dropna()` to remove them.
- **Removing Duplicates**: Use `drop_duplicates()` to remove duplicate entries and ensure data integrity.
- **Data Type Conversion**: Easily convert data types with functions like `astype()` to ensure data is in the correct format for analysis.

## 3. Data Analysis

- **Summary Statistics**: Generate summary statistics like mean, median, and standard deviation with `describe()`.
- **Grouping and Aggregation**: Group data and perform aggregate functions (e.g., `sum()`, `mean()`) to extract insights and analyze patterns.
- **Merging and Joining**: Combine datasets using `merge()` and `join()` based on common columns or indices, essential for integrating diverse data sources.

## 4. Application in Data Science

- **Efficiency**: Pandas is optimized for performance and can handle large datasets efficiently compared to traditional Python data structures.
- **Integration with Other Libraries**: It integrates well with libraries like NumPy, Matplotlib, and Scikit-learn, enhancing its utility in data science workflows.
- **Data Cleaning and Exploratory Data Analysis (EDA)**: Pandas is crucial for data cleaning, preprocessing, and EDA, helping understand data distribution, detect anomalies, and prepare data for machine learning models.

## Benefits for Data Science Professionals

1. **Streamlined Data Processing**: Simplifies complex data manipulation tasks, allowing data scientists to focus on analysis rather than data wrangling.
2. **Enhanced Productivity**: Reduces the amount of code needed for data analysis, leading to faster development cycles.
3. **Better Data Insights**: Facilitates sophisticated data transformations and aggregations, uncovering valuable insights and trends.
4. **Real-World Applications**:
   - **Financial Analysis**: Analyze stock prices, generate financial reports, and forecast trends.
   - **Scientific Research**: Process and analyze experimental data, manage large datasets, and visualize results.
   - **Machine Learning**: Preprocess data, perform feature engineering, and prepare datasets for model training.

## Conclusion

Pandas is a powerful and versatile tool in data science. Its efficiency in handling, manipulating, and analyzing data makes it essential for data science professionals. By streamlining data processing tasks and providing comprehensive analysis capabilities, Pandas enhances productivity and supports more effective data-driven decision-making.
