# What is Pandas?

**Pandas** is an open-source Python library for data manipulation & analysis.

Pandas is designed to handle data in various formats, such as tabular data, time series data, and more, making it an essential part of the data processing workflow in many industries.

Here are some **key features and functionalities of Pandas**: 

* **Data Structures**: Pandas offers two primary data structures - **DataFrame** and **Series**.
    * A **DataFrame** is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
    * A **Series** is a one-dimensional labeled array, essentially a single column or row of data.

* **Data Import and Export**: 
    * Pandas makes it easy to read data from various sources, including CSV files, Excel spreadsheets, SQL databases, and more.
    * It can also export data to these formats, enabling seamless data exchange. 

* **Data Merging and Joining**: You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources. 

* **Efficient Indexing**: Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly. 

* **Custom Data Structures**: You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas' capabilities.

---

Dataframe example with **ROW & COLUMN LABEL INDEX**.

![image.png](attachment:119e91b4-50da-47ec-9b3f-01bc4c4e2e25.png)

In [2]:
# importing pandas
import pandas as pd

# Data Loading

* Pandas can be used to load data from various sources, such as CSV and Excel files.
* The `read_csv()` function is used to load data from a CSV file into a Pandas DataFrame.
* The `read_excel()` function is used to load data into a Pandas DataFrame from various Excel file formats, including `.xls`, `.xlsx`, `.xlsm`, `.xlsb`, `.odf`, `.ods`, and `.odt`.

To read a CSV (Comma-Separated Values) file in Python using the Pandas library, you can use the `pd.read_csv()` function. 

Here's the syntax to read a CSV file:

```python
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('your_file.csv')      # Make sure that the file is located in the same directory as your Python script, or you provide the correct file path.

# Important attributes & functions
print(df.head()) # default first 5 rows
print(df.tail()) # default last 5 rows
print(df.type)
print(df.info())

```

# What is a Series?

A Series is a one-dimensional **labeled array** in Pandas. 
* It can be thought of as a **single column of data with labels or indices** for each element.
* You can create a Series from various data sources, such as lists, NumPy arrays, or dictionaries.

Here's a basic example of creating a Series in Pandas. 

> Notice that Pandas automatically assigned **numerical indices (0, 1, 2, 3, 4)** to each element, but you can also specify **custom labels**, if needed.

In [3]:
import pandas as pd

# Create a Series from a list
data = [10, 20, 30, 40, 50]

print("ORIGINAL SERIES")
s = pd.Series(data)
print(s)

ORIGINAL SERIES
0    10
1    20
2    30
3    40
4    50
dtype: int64


# Accessing Elements in a Series

You can access elements in a Series using the **index labels or integer positions**. 

Here are a few common methods for accessing Series data:

In [4]:
# Access the element with label 2 (value 30)
print(s[2])     

# Access the element at position 3 (value 40)
print(s.iloc[3]) 

# Access a range of elements by label
print(s[1:4])   

30
40
1    20
2    30
3    40
dtype: int64


# Series Attributes and Methods

Pandas Series come with various attributes and methods to help you manipulate and analyze data effectively. 

Here are a few essential ones:
* `values`: Returns the Series data as a NumPy array.
* `index`: Returns the index (labels) of the Series.
* `shape`: Returns a tuple representing the dimensions of the Series.
* `size`: Returns the number of elements in the Series.
* `mean()`, `sum()`, `min()`, `max()`: Calculate summary statistics of the data.
* `unique()`, `nunique()`: Get unique values or the number of unique values.
* `sort_values()`, `sort_index()`: Sort the Series by values or index labels.
* `isnull()`, `notnull()`: Check for missing (NaN) or non-missing values.
* `apply()`: Apply a custom function to each element of the Series.

# What is a DataFrames?

A **DataFrame** is a two-dimensional labeled data structure with columns of potentially different data types. 
* Think of it as a table where **each column represents a variable**, and **each row represents an observation or data point**.
* DataFrames are suitable for a wide range of data, including structured data from CSV files, Excel spreadsheets, SQL databases, and more.

# Creating DataFrames from Dictionaries

DataFrames can be created from dictionaries, with keys as column labels and values as lists representing rows.

In [5]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)
print(df)


      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   28        Chicago


# Column Selection

* You can select a single column from a DataFrame by specifying the column name within double brackets.
* Multiple columns can be selected in a similar manner, creating a new DataFrame.

In [6]:
print(df['Name'])  # Access the 'Name' column
print(type(df['Name']))

print("---")
      
print(df[['Name']])  # Access the 'Name' column
print(type(df[['Name']]))

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object
<class 'pandas.core.series.Series'>
---
      Name
0    Alice
1      Bob
2  Charlie
3    David
<class 'pandas.core.frame.DataFrame'>


In [7]:
print(df[['Name', 'Age']])  # Access the 'Name' column

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   28


# Accessing Rows

You can access rows by their **index** using `.iloc[]` or by **label** using `.loc[]`.

**`loc()`**:
* It's a **label-based** data selection method, which means that we have to pass the **name of the row or column label** that we want to select.
* This method **includes the last element of the range** passed in it. **Slicing usecase**)
* **Syntax**: `df.loc[row_label, column_lable]`

**`iloc()`**:
* It's an **index-based** data selection method, which means that we have to pass an **integer index** in the method to select a specific row/column.
* This method **does not include the last element of the range** passed in it. (**Slicing usecase**)
* **Syntax**: `df.iloc[row_index, column_index]`

In [8]:
# Access the data in the dataframe having index (0,0)
print(df.iloc[0,0])  
print()

# Access the data in the dataframe having index (0,2)
print(df.iloc[0,2])  
print()

# Access the data in the dataframe having row_label=0 & column_label='Name'
print(df.loc[0,'Name'])  


Alice

New York

Alice


In [9]:
# Copy Dataframe
df_copy = df
print(df_copy)

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   28        Chicago


In [10]:
# Setting 'Name' column values as the row labels or index
df2 = df_copy.set_index('Name')
print(df2)

print()
print(df)

         Age           City
Name                       
Alice     25       New York
Bob       30  San Francisco
Charlie   35    Los Angeles
David     28        Chicago

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   28        Chicago


In [33]:
print(df2.loc['Alice', 'Age'])

25


# Slicing

You can slice DataFrames to select specific rows and columns.

Both index & labels can be used for slicing.

**Syntax**: 
* `df.loc[row_start_label : row_stop_label, column_start_label : column_stop_label]`         -- **including** stop_labels of the row & column | Labels must be found in the DataFrame, or you will get a KeyError.
* `df.iloc[row_start_index : row_stop_index, column_start_index : column_stop_index]`        -- **excluding** stop_index of the row & column | Doesnot throw KeyError if index not found.

In [36]:
print(df2)

         Age           City
Name                       
Alice     25       New York
Bob       30  San Francisco
Charlie   35    Los Angeles
David     28        Chicago


In [40]:
print(df2.iloc[0:2, 0:3])

       Age           City
Name                     
Alice   25       New York
Bob     30  San Francisco


In [44]:
print(df.loc[0:2, 'Age': 'City']) # df because df2 doesn't have numerical row index.

   Age           City
0   25       New York
1   30  San Francisco
2   35    Los Angeles


In [43]:
print(df2.loc['Bob':'Charlie', 'Age': 'City'])

         Age           City
Name                       
Bob       30  San Francisco
Charlie   35    Los Angeles


# Finding Unique Elements

Use the unique method to determine the unique elements in a column of a DataFrame.

In [19]:
unique_dates = df['Age'].unique()
print(unique_dates)

[25 30 35 28]


# Conditional Filtering

You can filter data in a DataFrame based on conditions using inequality operators. For instance, you can filter albums released after a certain year.

In [20]:
high_above_102 = df[df['Age'] > 25]
print(high_above_102)

      Name  Age           City
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   28        Chicago


# Saving DataFrames

To save a DataFrame to a CSV file, use the **`to_csv()`** method and specify the filename with a **“.csv”** extension.

Pandas provides other functions for saving DataFrames in different formats.

```python
df.to_csv('trading_data.csv', index=False)
```

# DataFrame Attributes and Methods

DataFrames provide numerous attributes and methods for data manipulation and analysis, including:
* `shape`: Returns the dimensions (number of rows and columns) of the DataFrame.
* `info()`: Provides a summary of the DataFrame, including data types and non-null counts.
* `describe()`: Generates summary statistics for numerical columns.
* `head()`, `tail()`: Displays the first or last n rows of the DataFrame.
* `mean()`, `sum()`, `min()`, `max()`: Calculate summary statistics for columns.
* `sort_values()`: Sort the DataFrame by one or more columns.
* `groupby()`: Group data based on specific columns for aggregation.
* `fillna()`, `drop()`, `rename()`: Handle missing values, drop columns, or rename columns.
* `apply()`: Apply a function to each element, row, or column of the DataFrame.

**Pandas official website: https://pandas.pydata.org/docs/**

# Conclusion

In conclusion, mastering the use of Pandas Series and DataFrames is essential for effective data manipulation and analysis in Python. 
* Series provide a foundation for handling one-dimensional data with labels, while DataFrames offer a versatile, table-like structure for working with two-dimensional data.
* Whether you're cleaning, exploring, transforming, or analyzing data, these Pandas data structures, along with their attributes and methods, empower you to efficiently and flexibly manipulate data to derive valuable insights.
* By incorporating Series and DataFrames into your data science toolkit, you'll be well-prepared to tackle a wide range of data-related tasks and enhance your data analysis capabilities.