# Pandas Basics

> Pandas is a widely-used Python library for data manipulation and analysis, centred around tabular data structures. It offers powerful and flexible data structures, such as `DataFrame`s and `Series`, to enable efficient data wrangling, cleaning, and analysis.

### Key Features of Pandas

Pandas has a wide range of features for viewing and manipulating tabular data:

- **Data Exploration and Analysis**: It provides functions for calculating descriptive statistics, summarising data, and identifying trends or patterns
- **Graphing and Visualization**: It has integrations with libraries like Matplotlib for creating informative plots and charts directly from `DataFrame`s
- **Data Cleaning and Preparation**: Tools for handling missing data, detecting outliers, and preparing datasets for analysis or machine learning models
- **Data Transformation**: Capabilities for reshaping, pivoting, and transforming datasets to suit specific analysis needs
- **File Compatibility and I/O**: Efficiently reads and writes to a variety of file formats, such as CSV, Excel, and SQL databases, facilitating easy data import and export

## Practical Applications

Pandas is used in various applications within data science and related fields:

- **Data Cleaning and Preparation**: It streamlines the process of cleaning raw data, handling missing values, and preparing datasets for analysis
- **Exploratory Data Analysis (EDA)**: Offers tools for deep-diving into datasets, uncovering insights, and informing subsequent analysis or modelling decisions
- **Data Transformation and Aggregation**: Essential for reshaping data, creating summary tables, and performing group-wise operations for data analysis
- **Data Visualisation**: Aids in creating meaningful visual representations of data, crucial for reporting and storytelling in data science

Pandas' functionality is not only robust but also relatively user-friendly, making it a go-to tool for data professionals seeking to perform efficient and effective data analysis.

## Installation

Pandas can be installed directly via `pip` using the `pip install pandas` command. Alternatively you can create a Conda environment and then use `conda install pandas` inside the activated environment.

You can check that Pandas is installed correctly using the following command in your CLI:

`python -c "import pandas as pd; print(pd.__version__)"`

This should return the version of Pandas that you have installed.

## Creating Your First Pandas Project

For most operations, Pandas is best used inside a Jupyter Notebook. Notebooks allow you to run individual code blocks separately, and render Pandas outputs in a more human-readable format.

1. Create a new notebook by typing `touch pandas_demo.ipynb` in the CLI
2. Inside the notebook, write `import pandas as pd`. Pandas is typically imported using the `pd` alias for the sake of brevity.



## Pandas Data Structures

> Pandas introduces two primary data structures: `Series` and `DataFrame`. A `Series` is a one-dimensional array-like object, akin to a column in a spreadsheet, while a `DataFrame` is a two-dimensional table with rows and columns, similar to a whole spreadsheet. Both are built on top of `array` objects from the NumPy library, which are multidimensional arrays intended for numerical data. Pandas extends their capabilities with more functionality and a focus on using a wider variety of data types. Pandas `Series` objects are ideal for single-dimensional data representation, whereas `DataFrames` are more suited for complex, multi-dimensional data analysis.  

Each of these data structures have a wide range of powerful built-in methods that allow you to perform all sorts of data tasks quickly and efficiently.


###  Pandas `DataFrame`


>A `DataFrame` (often abbreviated to `df`) is like a table or an Excel spreadsheet. It's a 2-dimensional array used to store and work with data. It has rows and columns, where each column can have a different type of data, like numbers or text. It's great for analysing and organizing tabular data. Like other Python data structures, a `DataFrame` is an object with a range of associated methods, and it is these methods that make the `DataFrame` a powerful tool for analysis.

Let's take a look at an example `DataFrame`. We will define a simple Python dictionary and then convert it into a `DataFrame` using the `pd.Dataframe()` function. You can then view the first few rows using the `DataFrame`'s `.head()` method, optionally passing a number of rows as an argument:











In [None]:
import pandas as pd

# Example dictionary
data = {
    'Name': ['Alice', 'Belkis', 'Carlotta'],
    'Age': [25, 60, 35],
    'City': ['New York', 'Addis Ababa', 'Panama City']
}

# Creating a DataFrame from the dictionary
example_df = pd.DataFrame(data)
example_df.head(5)

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Belkis,60,Addis Ababa
2,Carlotta,35,Panama City


We can see that the `pd.DataFrame` constructor has taken our dictionary and produced a table with the keys as column names and the value lists as the data for each column. It has also added an extra column on the left. This is known as the `index` and is used to identify a specific row.
### Pandas `Series`
> A `Series` is like a single column of a `DataFrame`, a 1-dimensional array that can hold data of any type. 

 We can create a `Series` from a list as follows:

In [None]:
example_series = pd.Series([1, 2, 3])
example_series

0    1
1    2
2    3
dtype: int64

## Indexing and Slicing Pandas Data Structures

>In Pandas, indexing and slicing enable precise data selection and manipulation within `Series` and `DataFrame`s. This functionality is akin to accessing elements in Python lists or arrays, but with enhanced capabilities. Indexing allows for selecting specific rows or columns using labels or positions, while slicing facilitates retrieving subsets of data. 

A `Series` can be indexed exactly like a Python list:



In [None]:
example_series[0]

1

Indexing and slicing of `DataFrame` objects can be performed in multiple ways. We can select a column using its name in  square brackets`[]`, and it will be extracted as a `Series` object:

In [None]:
single_column = example_df['Name']
print(type(single_column))
single_column.head()

<class 'pandas.core.series.Series'>


0       Alice
1      Belkis
2    Carlotta
Name: Name, dtype: object

Alternatively we can select a column (or columns) in double square brackets, and it will be extracted as a `DataFrame` object

In [None]:
single_column = example_df[['Name']]
print(type(single_column))
single_column.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name
0,Alice
1,Belkis
2,Carlotta


Finally, it is also possible to select a column as a series using `.` notation:

In [None]:
type(example_df.Name)

pandas.core.series.Series

### The `.loc` Attribute
We can index a specific row of a `DataFrame` using the locate row (`.loc`) attribute. This selects rows based on the values in the index column. Note the syntax: just use the row index in square brackets `[]`, no parentheses `()`.

In [None]:
example_df.loc[0]

Name       Alice
Age           25
City    New York
Name: 0, dtype: object

You can also select multiple rows simultaneously by supplying a list as an argument:

In [None]:
example_df.loc[[0,2]]

Unnamed: 0,Name,Age,City
0,Alice,25,New York
2,Carlotta,35,Panama City


### The `iloc` Attribute
Currently our index column values are the integers [`0`, `1`, `2`...], as per Python. However we can also use any other list of unique values as our index:

In [None]:
example_df.index = ['a', 'b', 'c']
example_df.head()

Unnamed: 0,Name,Age,City
a,Alice,25,New York
b,Belkis,60,Addis Ababa
c,Carlotta,35,Panama City


The `loc` attribute is for **label-based indexing**. Using `loc` will work as before with the new index values:

In [None]:
example_df.loc['a']

Name       Alice
Age           25
City    New York
Name: a, dtype: object

But we now have a mismatch between the Pythonic indexing and our index column. Under these circumstances we can choose to use the `iloc` attribute to use **position based indexing** instead:


In [None]:
example_df.iloc[0]

Name       Alice
Age           25
City    New York
Name: a, dtype: object

### Slicing Rows and Columns Together

As well as selecting individual rows and columns, we can index or slice by both at once:


#### Using `loc`:


In [None]:
example_df.loc['a',['Name','Age']]

Name    Alice
Age        25
Name: a, dtype: object

#### Using `iloc`:

In [None]:
example_df.iloc[0,[0,1]]

Name    Alice
Age        25
Name: a, dtype: object

## Resetting the Index

> Resetting the index of a Pandas `DataFrame` is often used after performing data manipulations like slicing or filtering, which can leave the `DataFrame` with an index that is non-sequential or not aligned with the data's current state. 

### Why Reset the Index?

- **Non-Sequential Indices:** After slicing or filtering a `DataFrame`, the resulting index might be non-sequential or non-contiguous, which can be confusing and may cause issues with data alignment or further data manipulation
- **Alignment and Consistency:** Resetting the index ensures that the `DataFrame` maintains a consistent structure, with a sequential numeric index starting from 0
- **Ease of Merging and Joining:** A standard, sequential index is often easier to work with when performing database-style join or merge operations


Let's consider a `DataFrame` and perform a slicing operation to filter out certain rows. We'll then see the need to reset the index.

In [None]:
import pandas as pd

# Example DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 35, 29, 40]
}
df = pd.DataFrame(data)

# Slicing the DataFrame to select specific rows
sliced_df = df[df['Age'] > 30]

sliced_df.head()

Unnamed: 0,Name,Age
1,Bob,35
3,Diana,40


Our filtering operation has made the index column non-contiguous, so we should now apply `.reset_index():`

In [None]:
reset_df = sliced_df.reset_index(drop=True) # drop=True prevents the old index from being added as an extra column
reset_df.head()

Unnamed: 0,Name,Age
0,Bob,35
1,Diana,40


## Importing Data into Pandas 

> Pandas provides methods for importing various data types and loading them into a `DataFrame`. There are methods to read data from sources like `CSV`, Excel, `JSON`, and relational databases like SQL tables.
### Creating a `DataFrame` from Python Objects

As we saw earlier in the lesson, a `DataFrame` can be created from a dictionary of lists, where the keys are the column headings and the values are `lists` of column values.


In [None]:
# create a df from a dictionary, where each key is a column name, and each values is a list of column values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 35, 29, 40]
}
names_df = pd.DataFrame(data)
names_df.head()

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,35
2,Charlie,29
3,Diana,40



You can also make a `DataFrame` from a list of lists, in which case you should supply the column names using the `columns` parameter.

By default, each list will represent a row of the `DataFrame`:

In [None]:
## Creating a DF from a list of lists, with each list representing a row

list_1 = [1,"Jan"]
list_2 = [2, "feb"]

month_df = pd.DataFrame([list_1, list_2], columns=["number", "name"])
month_df.head()




Unnamed: 0,number,name
0,1,Jan
1,2,feb


To create from a list of lists where each list is a column, you can use Python's `zip` function:

In [None]:
## Creating a DF from a list of dictionaries, with each dictionary representing a column

data = [
    [1, 2, 3],
    ['Alice', 'Bob', 'Charlie'],
    [30, 25, 35]
]

# Create a DataFrame with each list as a column
employee_df = pd.DataFrame(list(zip(*data)), columns=['id', 'name', 'age'])
employee_df.head()

Unnamed: 0,id,name,age
0,1,Alice,30
1,2,Bob,25
2,3,Charlie,35


### Importing from `CSV`

One of the most common file types to read into Pandas is comma separated values (`CSV`). You can import data from a `CSV` file using the `pd.read_csv` function:

In [None]:
salaries = pd.read_csv('https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv')
salaries.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


#### Optional Parameters

- **`sep`**: Defines the delimiter to use. The default is `,`, and you can check your `CSV` file to see which should be used 
  
- **`header`**: Indicates the row number to use as column names (0-indexed). Default is `0` (first line), but can be set to `None`, in case you have no column names in your first row.
  
- **`index_col`**: This parameter is used to specify which column should be used as the row index. It can be an integer (column position) or a string (column label).
  
- **`usecols`**: Useful when you want to load only specific columns. Pass a list of column names or numbers.
  
- **`dtype`**: Dictates the data type for each column, and should be a list of the same length as the number of columns
#### Handling Common `CSV` Loading Issues

A number of factors can affect the proper reading of `CSV` files. Here are some example issues and how to solve them:


- **Different Delimiters:** `CSV` files may use delimiters other than commas (like tabs or semicolons). This will cause the DataFrame to be formatted incorrectly, or to throw an error during loading. Use the `sep` parameter to specify the delimiter, e.g. `pd.read_csv('file.csv', sep='\t')` for tab-delimited files.
  
- **Missing Headers:** If a `CSV` file doesn’t have a header row, set `header=None` to prevent the first row from being treated as column names. You can then assign column names using the `names` parameter.
  
- **Encoding Issues:** `CSV` files can have different encodings (like `UTF-8`, `Latin1`). If you encounter encoding errors, use the `encoding` parameter, e.g., `pd.read_csv('file.csv', encoding='latin1')`.
  
### Importing from Excel

Pandas provides the `pd.read_excel()` function to read Excel files. Here is an example load statement, using the first column as the index:




In [None]:
excel_df = pd.read_excel('example.xlsx', index_col=0)
excel_df.head()


Unnamed: 0_level_0,name,age,salary
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Aarav Patel,39,50000
1,Ying Chen,23,45000
2,John Smith,48,46500
3,Maria Garcia,22,35000
4,Fatima Zahra,38,62000


#### Optional Parameters

- `sheet_name`: Specifies which sheet to read. By default, it's set to `0`, meaning the first sheet. You can set it to the sheet's name or index.
  
- `header`: Identifies the row to use as column names
  
- `index_col`: Selects a column to use as the row index
  
- `usecols`: Limits the import to a list of specific columns
### Outputting `DataFrame` Data to a File
Pandas `DataFrame` objects have various methods for writing data from different file types. 

For example, to write data to a `CSV` file, you can use the `to_csv` method:


In [None]:
example_df.to_csv('test.csv', index=False) # index=False is used to avoid saving the index column.

The `index=False` parameter is used to prevent the index column getting saved with the rest of the data. Otherwise each time you load and save the file, Pandas will add an extra index column.

Similar to the `to_csv()` method, you can output the data to a `JSON` file using the `to_json()` method.

### Exporting a `DataFrame` to a Dictionary

Another option is to export the `DataFrame` to a python `dict`, which can then be saved as anything that can support that data structure, such as a `YAML` or `pickle` file. This is achieved with the `to_dict()` method as follows:

In [None]:
sample_dict = example_df.to_dict()
print(type(sample_dict))

##  `DataFrame` Copies and Aliases

> It is important to understand that in Pandas, when you assign a `DataFrame` to a new variable, you are creating an **alias**, not a **copy**. This means both variables refer to the same underlying data.

In the code block below, you can see that if you assign a new variable name to a `DataFrame` and then change something in the new variable, you are changing the same underlying `DataFrame` as if you ran the manipulation on the old variable name. This is because both are **aliases** of the same object.

In [None]:
# Create a simple DataFrame
import pandas as pd
my_dict = {'Animal': ['Dog', 'Cat', 'Bird'], 'Age': [2, 4, 1]}
my_df = pd.DataFrame(my_dict)

# Assign the DataFrame to a new variable, then change a value:
new_df = my_df
new_df.iloc[0, 0] = 'Pig'

# Print the original DataFrame
print(my_df)


  Animal  Age
0    Pig    2
1    Cat    4
2   Bird    1


### Creating a Copy

A new `DataFrame` is generated as the output to most operations you perform on your original `df` however. For example:

In [None]:
my_new_df = my_df.sort_values(by='Age', ascending=False)
my_new_df.head() # The new DataFrame has been re-sorted

Unnamed: 0,Animal,Age
1,Cat,4
0,Pig,2
2,Bird,1


In [None]:
my_df.head() # The original DataFrame has not been changed

Unnamed: 0,Animal,Age
0,Pig,2
1,Cat,4
2,Bird,1


You can also make an unchanged copy of the `DataFrame` using the `.copy()` method:

In [None]:
my_new_df = my_df.copy()

### The `inplace` Parameter

> For some `DataFrame` operations, it is possible to perform the change on the existing `DataFrame` without making an explicit assignment, by using the argument `inplace = True`. The parameter causes the associated method to modify the original `DataFrame` directly, rather than returning a new `DataFrame` with the applied changes. When set to True, the operation occurs in place and the original `DataFrame` is altered, thereby eliminating the need to assign the result to a new variable.

Unfortunately, the implementation of this parameter is somewhat inconsistent across the library, so you might need to experiment, or refer to the documentation, to learn which methods will allow you to do this.

In [None]:
my_new_df.drop('Age', axis=1, inplace=True) # Drops the 'Age' column in-place
my_new_df.head()

Unnamed: 0,Animal
0,Pig
1,Cat
2,Bird


## Key Takeaways

- Pandas is a Python library for analysing and manipulating large tabular datasets
- Use Pandas inside a Jupyter Notebook for EDA and data cleaning
- A `DataFrame` is a 2D array-like table data structure
- A `Series` is a 1-dimensional array in pandas that can hold any type of data
- `Series` in pandas can be indexed like Python lists
- Use `loc` for label-based indexing and `iloc` for position-based indexing
- Pandas allows indexing or slicing of both rows and columns simultaneously using `loc` and `iloc`
- Resetting the index of a `DataFrame` is useful after data manipulations like slicing or filtering
- Pandas allows importing data from various sources like `CSV`, Excel, `JSON`, and SQL into a `DataFrame`
- Pandas can read `CSV` files using pd.read_csv, with optional parameters to handle issues like different delimiters, missing headers, and encoding issues
- The `pd.read_excel()` function reads Excel files, with options to specify sheet, header, index column, and specific columns
- The `to_csv()`, `to_json()` and `to_dict()` methods can be used to export data from a `DataFrame`