![introduction_to_pandas.png](attachment:366bf024-656b-4bcd-b138-1c07bd5e2987.png)

# Introduction to Pandas

#### by Joe Eberle started on 05-23-2023 - https://github.com/JoeEberle/ - josepheberle@outlook.com

## Introduction to Pandas

Pandas is a **powerful Python library** used for:
- **data manipulation**
- **data analysis**
- **ETL Processing and data integration**

Pandas offers data structures like DataFrame and Series to **efficiently handle structured data**. 

Pandas is important because it simplifies complex data operations, making it **easier to clean**, analyze, and visualize large datasets, which is essential for data science and machine learning tasks.


##  Here are some reasons people use Pandas for data science and analysis:

1. **Data Manipulation and Cleaning:** Pandas provides powerful tools for data manipulation, allowing users to efficiently clean, transform, and prepare data for analysis.
2. **Flexible Data Structures:** The DataFrame and Series data structures in Pandas are versatile and can handle a variety of data formats, making it easy to import, export, and work with different types of data.
3. **Ease of Use:** Pandas offers a user-friendly syntax and functions that simplify complex data operations, making it accessible for both beginners and experienced data scientists.
4. **Integration with Other Libraries:** Pandas integrates seamlessly with other popular data science libraries such as NumPy, Matplotlib, and scikit-learn, enhancing its functionality and enabling comprehensive data analysis workflows.
5. **Performance and Efficiency:** Pandas is optimized for performance, allowing users to process large datasets quickly and efficiently, which is crucial for data-intensive tasks in data science.


## Step 1 - Installing Pandas

In [1]:
first_installation = False
if first_installation:
    !pip install pandas

## Step 2 - Importing Libraries Including Pandas

In [3]:
from datetime import datetime
import pandas as pd 
import quick_logger as ql
import talking_code as tc 
import file_manager as fm 
import time
print(f"Libraries Imported succesfully on {datetime.now().date()} at {datetime.now().time()}") 

Libraries Imported succesfully on 2024-06-14 at 08:29:56.516298


#### Required Setup Step 0 - Intitiate Configuration Settings and name the overall solution

In [3]:
import configparser 
config = configparser.ConfigParser()
cfg = config.read('config.ini')  
solution_name = 'introduction_to_pandas'

#### Required Setup Step 0 - Intitiate Logging and debugging 

In [4]:
# Establish the Python Logger  
import logging # built in python library that does not need to be installed 
import quick_logger as ql

global start_stime 
start_time = ql.set_start_time()
logging = ql.create_logger_start(solution_name, start_time) 
ql.set_speaking_log(False)
ql.set_speaking_steps(False)
ql.pvlog('info',f'Process {solution_name} Step 0 - Initializing and starting Logging Process.') 

Process introduction_to_pandas Step 0 - Initializing and starting Logging Process.


## Understand the primary data structures: **Dataframe** 

### A pandas **DataFrame** is a two-dimensional table of data in Python, similar to an Excel spreadsheet with labeled rows and columns.

In [17]:
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df_example = pd.DataFrame(data)

In [18]:
# display the contents of the dataframe
df_example

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [19]:
# display the contents of the dataframe
df_example.head(2)

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles


In [20]:
# display the contents of the dataframe
df_example.tail(10)

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


## create a data playground 

In [15]:
import os
data_play_ground = 'C:\Data\sample_datasets' 
directory_path = r'C:\Data\sample_datasets'
# Create the directory if it doesn't exist
if not os.path.exists(directory_path):
    os.makedirs(directory_path)
    print(f"Directory '{directory_path}' created successfully.")
else:
    print(f"Directory '{directory_path}' already exists.")

Directory 'C:\Data\sample_datasets' already exists.


In [23]:
file_name = data_play_ground + "\\" + "df_example.xlsx"
df_example.to_excel(file_name) 

In [None]:
# Read in a CSV using Pandas 

In [11]:
df = pd.read_csv("https://raw.githubusercontent.com/JoeEberle/reference_datasets/main/titanic.csv")    # Read the CSV file into a pandas DataFrame
df.head(5)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


Use pd.read_csv() and df.to_csv() for reading from and writing to CSV files.
Other formats include Excel (pd.read_excel()), JSON (pd.read_json()), and SQL databases (pd.read_sql()).
Data Inspection:

Inspect data using methods like df.head(), df.tail(), df.info(), and df.describe() to get a quick overview of the DataFrame.
Indexing and Slicing:

Access data using .loc for label-based indexing and .iloc for position-based indexing.
Example: df.loc[0, 'col1'] or df.iloc[0, 0].
Data Cleaning:

Handle missing data using methods like df.dropna(), df.fillna(), and df.isna().
Remove duplicates with df.drop_duplicates().
Data Manipulation:

Use operations like df['new_col'] = df['col1'] + df['col2'] for creating new columns.
Apply functions across columns or rows with df.apply() and df.applymap().
Grouping and Aggregation:

Group data using df.groupby('col1') and perform aggregations like sum(), mean(), and count().
Example: df.groupby('col1').sum().
Merging and Joining:

Combine DataFrames using pd.merge() for SQL-like joins (inner, outer, left, right).
Concatenate DataFrames using pd.concat().

In [None]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Inspect the DataFrame
print(df.head())

# Select a column
print(df['Name'])

# Filter rows
print(df[df['Age'] > 25])

# Group by and aggregate
grouped = df.groupby('City').mean()
print(grouped)

# Merge with another DataFrame
data2 = {'City': ['New York', 'Los Angeles', 'Chicago'], 'Population': [8000000, 4000000, 2700000]}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df, df2, on='City')
print(merged_df)


In [1]:
definition = '''
## Introduction to Pandas

Pandas is a **powerful Python library** used for:
- **data manipulation**
- **data analysis**
- **ETL Processing and data integration**

Pandas offers data structures like DataFrame and Series to **efficiently handle structured data**. 

Pandas is important because it simplifies complex data operations, making it **easier to clean**, analyze, and visualize large datasets, which is essential for data science and machine learning tasks.


##  Here are some reasons people use Pandas for data science and analysis:

1. **Data Manipulation and Cleaning:** Pandas provides powerful tools for data manipulation, allowing users to efficiently clean, transform, and prepare data for analysis.
2. **Flexible Data Structures:** The DataFrame and Series data structures in Pandas are versatile and can handle a variety of data formats, making it easy to import, export, and work with different types of data.
3. **Ease of Use:** Pandas offers a user-friendly syntax and functions that simplify complex data operations, making it accessible for both beginners and experienced data scientists.
4. **Integration with Other Libraries:** Pandas integrates seamlessly with other popular data science libraries such as NumPy, Matplotlib, and scikit-learn, enhancing its functionality and enabling comprehensive data analysis workflows.
5. **Performance and Efficiency:** Pandas is optimized for performance, allowing users to process large datasets quickly and efficiently, which is crucial for data-intensive tasks in data science.

''' 
# Write the solution defitions out to the solution_description.md file
file_name = "solution_description.md"
with open(file_name, 'w') as f:
    # Write the template to the readme.md file
     f.write(definition)

talking_code = False
if talking_code:
    tc.print_say(definition) 
else:
    print(definition)    


## Introduction to Pandas

Pandas is a **powerful Python library** used for:
- **data manipulation**
- **data analysis**
- **ETL Processing and data integration**

Pandas offers data structures like DataFrame and Series to **efficiently handle structured data**. 

Pandas is important because it simplifies complex data operations, making it **easier to clean**, analyze, and visualize large datasets, which is essential for data science and machine learning tasks.


##  Here are some reasons people use Pandas for data science and analysis:

1. **Data Manipulation and Cleaning:** Pandas provides powerful tools for data manipulation, allowing users to efficiently clean, transform, and prepare data for analysis.
2. **Flexible Data Structures:** The DataFrame and Series data structures in Pandas are versatile and can handle a variety of data formats, making it easy to import, export, and work with different types of data.
3. **Ease of Use:** Pandas offers a user-friendly syntax and functions

## Step 0 - Process End - display log

In [6]:
# Calculate and classify the process performance 
status = ql.calculate_process_performance(solution_name, start_time) 
print(ql.append_log_file(solution_name))  

2024-06-13 14:37:50,399 - INFO - START introduction_to_pandas Start Time = 2024-06-13 14:37:50
2024-06-13 14:37:50,399 - INFO - introduction_to_pandas Step 0 - Initialize the configuration file parser
2024-06-13 14:37:50,400 - INFO - Process introduction_to_pandas Step 0 - Initializing and starting Logging Process.
2024-06-13 14:37:50,430 - INFO - PERFORMANCE introduction_to_pandas The total process duration was:0.03
2024-06-13 14:37:50,430 - INFO - PERFORMANCE introduction_to_pandas Stop Time = 2024-06-13 14:37:50
2024-06-13 14:37:50,430 - INFO - PERFORMANCE introduction_to_pandas Short process duration less than 3 Seconds:0.03
2024-06-13 14:37:50,430 - INFO - PERFORMANCE introduction_to_pandas Performance optimization is not reccomended



#### https://github.com/JoeEberle/ -- josepheberle@outlook.com