# Pandas Basics

## What is Pandas?

A package for the Python programming language for data manipulation and analysis. 

In particular, it offers data structures and operations for manipulating numerical tables and time series.

## Installing Pandas (PIP)

Pip is a package manager for Python that automates the process of installing, upgrading, configuring, and removing packages.

`How it works`
Pip uses PyPi as its default repository for fetching packages, but it can also install packages from other sources, such as version control systems, requirements files, and distribution files.

`When it's included`
Pip is included by default in Python version 3.4 or later. If you installed Python from source, with an installer from python.org, or via Homebrew, you should already have pip.

`How to use it`
You run pip from your system's command-line interface, not from within Python itself. To install a package, type pip install and then the name of the package.

For VS Code, this is run in your terminal (and venv if you're using one.)

Syntax `pip install pandas`


## Introduction to Pandas
- Pandas is a Python library used for working with data sets.
- It provides functions for analyzing, cleaning, exploring, and manipulating data.
- Created by Wes McKinney in 2008.
- The name Pandas is derived from **Panel Data** and **Python Data Analysis**.

# Why Use Pandas in Python?

## 1. Data Structures for Efficient Handling
- **Series**: One-dimensional labeled array (similar to a column in Excel or a list with labels).
- **DataFrame**: Two-dimensional labeled data structure (like a spreadsheet or SQL table).

## 2. Easy Data Manipulation
- Read and write data from multiple sources: **CSV, Excel, SQL databases, JSON**, and more.
- Perform operations like **filtering, sorting, reshaping, and aggregation** efficiently.

## 3. Powerful Data Cleaning Tools
- Handle missing values easily using `.fillna()`, `.dropna()`, and `.interpolate()`.
- Detect and remove duplicate entries with `.duplicated()` and `.drop_duplicates()`.

## 4. Fast and Efficient Data Processing
- Built on **NumPy**, enabling fast operations on large datasets.
- **Vectorized operations** avoid the need for slow Python loops.

## 5. Advanced Data Aggregation & Grouping
- Use `.groupby()` to aggregate data and perform computations like **mean, sum, count**, etc.
- Use `.pivot_table()` for **multi-dimensional data summarization**.

## 6. Merging and Joining Datasets
- Use `.merge()` and `.join()` to **combine multiple datasets** based on keys (like SQL joins).
- Use `.concat()` to concatenate datasets **vertically or horizontally**.

## 7. Time Series Analysis
- **Resample, shift, and handle time-indexed data** easily with `.resample()` and `.shift()`.
- Perform **rolling window calculations** using `.rolling()`.

## 8. Built-in Visualization
- Integrates with **Matplotlib** to create plots directly from DataFrames using `.plot()`.
- Supports **line plots, bar charts, histograms, scatter plots**, and more.

## 9. Compatibility with Other Libraries
- Works seamlessly with **NumPy, Matplotlib, Seaborn, Scikit-Learn, and TensorFlow**.
- Can convert data to **NumPy arrays** for machine learning models.

## 10. Data Export and Storage
- Easily save processed data back to **CSV, Excel, JSON, SQL**, and other formats.
- Helps in **data pipeline automation**.


### Installation of Pandas
If Pandas is not installed, you can install it using pip:
```
!pip install pandas
```

In [None]:
import pandas as pd
print('Pandas is successfully imported.')

### Pandas Series
A **Series** is a one-dimensional array holding data of any type. It is similar to a column in a table.

Note that you would only use a series if you are working with one dimensional data. While you can use a dataframe, it uses more memory. 

In [None]:
a = [1, 7, 2] #what type of variable is a?
myvar = pd.Series(a) #call for a series. Also note that pandas is called as pd becuase that's how we imported it. 
print(myvar)

By default, Pandas labels the Series with index numbers starting from 0. You can also create custom labels using the `index` argument.

In [None]:
a = [1, 7, 2]
myvar = pd.Series(a, index = ['x', 'y', 'z']) #must match the length of the input. 
print(myvar)

### Pandas DataFrame
A **DataFrame** is a 2-dimensional data structure, like a table with rows and columns.

In [None]:
data = {'calories': [420, 380, 390], 'duration': [50, 40, 45]} #what type is this?
df = pd.DataFrame(data)
print(df)

In [None]:
#In Jupyter Notebook, you can pretty print the df. With jus the dataframe's name.
df

### Read CSV Files
You can load CSV files directly into Pandas DataFrames using `pd.read_csv()`.

In [None]:
# Example to load CSV file
df = pd.read_csv('data.csv') #note how easy the syntax is in python
print(df.head(2))

In [None]:
df

Side Note: Here is the pandas source code for this funtion. Look how easy they made it for you!
    
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/readers.py

### Read JSON Files
You can also load JSON files into Pandas DataFrames using `pd.read_json()`.

In [None]:
# Example to load JSON file
df = pd.read_json('data.json')
print(df.head())

In [None]:
df

Side Note: Here is the pandas source code for this funtion. Look how easy they made it for you!
    
https://github.com/pandas-dev/pandas/blob/main/pandas/io/json/_json.py

### Analyzing DataFrames
You can get a quick overview of a DataFrame using the `head()` and `tail()` methods to view the first and last rows.

In [None]:
print(df.head(10)) #by default prints 5 rows, but you can put the rows you want to customize it.
print(df.tail())

The `info()` method gives you more information about the data, such as the number of entries, columns, and non-null values.

In [None]:
print(df.info())