# Pandas

Pandas is an open-source data analysis and manipulation library for Python. It provides easy-to-use data structures and data analysis tools to work with structured data seamlessly. The two primary data structures provided by Pandas are:

- `Series`: A one-dimensional array-like object that can hold any data type such as integers, strings, floating-point numbers, and Python objects.

- `DataFrame`: A two-dimensional table-like data structure that consists of rows and columns. It can be thought of as a spreadsheet or SQL table.

In addition to these data structures, Pandas also provides various tools for data manipulation, such as merging, grouping, and reshaping data.

## Installing Pandas

Before we get started, let's first make sure that Pandas is installed. You can install Pandas using `pip` by running the command `pip3 install pandas` in your terminal. You can also install it directly here using:

```bash
pip3 install pandas

```

## Creating DataFrame

To create a DataFrame, you can use the `pd.DataFrame()` function and pass in your data as a Python dictionary or a list of lists.

In [29]:
# Import pandas library
import pandas as pd

# Create a DataFrame from a dictionary
data = {'name': ['John','Harry','Alice'], 'age': [20, 30 ,40]} 
df = pd.DataFrame(data)
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,name,age
0,John,20
1,Harry,30
2,Alice,40


In [30]:
# Create a DataFrame from a list of lists 
data = [['John', 20], ['Harry', 30], ['Alice', 40]]
df_list = pd.DataFrame(data, columns=['name', 'age'])
print(type(df_list))
df_list

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,name,age
0,John,20
1,Harry,30
2,Alice,40


## Accessing Data

You can access data in a DataFrame using various methods. Here are some common ones:

### Indexing

- `df[col_name]`: Access a single column by column name. Return-Type: Series.
- `df[[col1, col2, ...]]`: Access multiple columns by column name. Return-Type: DataFrame.

In [31]:
df['name']      # Access a single column

0     John
1    Harry
2    Alice
Name: name, dtype: object

In [32]:
df[['name', 'age']]     # Access multiple columns

Unnamed: 0,name,age
0,John,20
1,Harry,30
2,Alice,40


### Condition based Filtering

- `df[condition]`: Filter data based on the given condition

In [33]:
# Select row in which age is greater than 25
df[df['age']>25]

Unnamed: 0,name,age
1,Harry,30
2,Alice,40


In [34]:
# Select row in which given name is included in name column
df[df['name'].isin(['Alice', 'Happy'])]

Unnamed: 0,name,age
2,Alice,40


In [35]:
# Select row having name that begins with 'A' and age above 23
df[df['name'].str.startswith('A') & (df['age']>23)]

Unnamed: 0,name,age
2,Alice,40


### Methods
 In Pandas, there are three main indexing methods to access data in a DataFrame: `loc`, `iloc`, and `at`.
 
- `df.loc[]`: Access data in a DataFrame using labels. It takes two parameters, the row label(s) and column label(s), and returns a subset of the original DataFrame. 

- `df.iloc[]`: Access data in a DataFrame using integer-based indexing. It takes two parameters, the row index(es) and column index(es), and returns a subset of the original DataFrame.

- `df.at[]`: Access a single cell in a DataFrame using labels. It is similar to `df.loc`, but is optimized for accessing a single cell.

In [45]:
print(df.loc[0],                    # Access a row data in a Dataframe
      df.loc[0, 'name'],            # Access single cell by label  
      df.loc[0:1],                  # Access multiple rows and columns
      df.loc[1:2, ['name', 'age']], # Access multiple rows and columns
      sep='\n\n')

name    John
age       20
Name: 0, dtype: object

John

    name  age
0   John   20
1  Harry   30

    name  age
1  Harry   30
2  Alice   40


In [47]:
print(df.iloc[0],                 # Access a row data in a DataFrame
      df.iloc[0, 0],              # Access single cell by index
      df.iloc[0:3],               # Access multiple rows
      df.iloc[1:3, [0, 1]],       # Access multiple rows and columns (assuming indices for 'name', 'age')
      sep='\n\n')

name    John
age       20
Name: 0, dtype: object

John

    name  age
0   John   20
1  Harry   30
2  Alice   40

    name  age
1  Harry   30
2  Alice   40


In [53]:
df.at[0, 'name']  # Access the 'name' cell in the first row

'John'

> Note: </br>
    - `.iloc` is exclusive of the end index, unlike `.loc`</br>
    - `.at` is the fastest accessor for accessing a single value</br>
    - If you try to use slicing or multiple access patterns with `.at`, you'll receive an error

## Reading Data

Pandas can read data from a variety of sources, including CSV files, Excel spreadsheets, SQL databases, and more. Here are some common ways to read data using Pandas:

- `pd.read_csv(file_path, delimiter=',')`: Reads CSV (Comma Separated Value) file and returns a Pandas DataFrame
    - *file_path*: Path to csv file
    - *delimiter*: Symbol separating different columns (default=',')
    
- `pd.read_excel(file_path, sheet_name=0)`: Reads excel file and returns a Pandas DataFrame
    - *file_path*: Path to excel file
    - *sheet_name*: Name or Index of List of excel sheet(s) to use (default=0)
    
- `pd.read_sql(query, conn)`: Read data from SQL databases. To do this, you first need to establish a connection to the database using a database driver such as `sqlite3` or `pymysql`.
    - *query*: SQL query
    - *conn*: Connection to SQL database

In [57]:
import pandas as pd

path = "../../assets/Datasets/House-Price.csv"
data = pd.read_csv(path)

## Writing Data

Similar to reading data from files, pandas provides several methods to write data to various file formats such as CSV, Excel, JSON, SQL, and more.

- `pd.to_csv(file_path, index)`: Writes Pandas DataFrame to csv file
    - *file_path*: Expected path to write csv file
    - *index*: If you want to add an extra column with row-wise indexing? (default=True)
    
- `pd.to_excel(file_path, index)`: Writes Pandas DataFrame to excel file

- `df.to_sql(table_name, conn, if_exists, index)`:
    - *table_name*: Database table name to write data into
    - *conn*: Connection to SQL database
    - *if_exists*: What if the table already exists? Options: {'fail', 'replace', 'append'} (default='fail')

In [58]:
import pandas as pd

data = pd.DataFrame({'name': ['Harry', 'Alice', 'Bob'], 'age': [25, 30, 40]})
data.to_csv('./../assets/Dataset/data.csv', index=False) # index=False --> Don't write index

OSError: Cannot save file into a non-existent directory: '../data'