# Introduction to Pandas

**Author:** ChatGPT & Tran Thu Le  
**Date:** 20/02/2023

**Abstract.** Pandas is a powerful data analysis library for Python. The two main data structures provided by pandas are the **Series** and **DataFrame** objects, which can be used to represent one-dimensional and two-dimensional data, respectively.

This tutorial covered some basic functions such as
1. selecting
2. filtering
3. adding
4. modifying
5. grouping 
6. sorting
7. Reading and Interacting with Excel Files

There are many more features to pandas, but this should give you a good foundation for getting started with your own data analysis projects.


## Installation
To install pandas, you can use the following command in terminal:

```bash
pip install pandas
```

In Jupyter Notebook or Google Colab, you should add `!` before the command.

```bash
!pip install pandas
```

## Creating a Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of labels, called its index. Here's an example of creating a Series:

In [1]:
import pandas as pd

s = pd.Series([1, 3, 5, -1, 6, 8])
print(s)

0    1
1    3
2    5
3   -1
4    6
5    8
dtype: int64


## Creating a DataFrame
A DataFrame is a two-dimensional table-like data structure consisting of rows and columns. You can create a DataFrame from a variety of data sources, such as a CSV file, a SQL query, or a dictionary. Here's an example of creating a DataFrame from a dictionary:

In [2]:
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Dave'],
    'age': [25, 32, 18, 47],
    'city': ['New York', 'Paris', 'London', 'San Francisco']
}
df = pd.DataFrame(data)
print(df)

      name  age           city
0    Alice   25       New York
1      Bob   32          Paris
2  Charlie   18         London
3     Dave   47  San Francisco


## Selecting Data
To select a single column from a DataFrame, you can use the column label as an index:

In [3]:
names = df['name']
print(names)

0      Alice
1        Bob
2    Charlie
3       Dave
Name: name, dtype: object


To select multiple columns, you can pass a list of column labels:

In [4]:
subset = df[['name', 'age']]
print(subset)


      name  age
0    Alice   25
1      Bob   32
2  Charlie   18
3     Dave   47


To select a row, you can use the loc[] function:

In [5]:
row = df.loc[0]
print(row)

name       Alice
age           25
city    New York
Name: 0, dtype: object


## Filtering Data
You can filter a DataFrame by selecting rows that meet certain criteria. For example, to select all rows where the age is greater than 30, you can use the following code:

In [6]:
subset = df[df['age'] > 30]
print(subset)

   name  age           city
1   Bob   32          Paris
3  Dave   47  San Francisco



## Adding Data

You can add a new column to a `DataFrame` by assigning a `Series` to a new column label:




In [7]:
df['new_column'] = pd.Series([1, 2, 3, 4])
print(df)

      name  age           city  new_column
0    Alice   25       New York           1
1      Bob   32          Paris           2
2  Charlie   18         London           3
3     Dave   47  San Francisco           4


## Modifying Data

To modify an existing column, you can use the column label and assign a new value to it:

In [8]:
df['age'] = df['age'] + 1
print(df)

      name  age           city  new_column
0    Alice   26       New York           1
1      Bob   33          Paris           2
2  Charlie   19         London           3
3     Dave   48  San Francisco           4


## Grouping Data
You can group a DataFrame by one or more columns and apply an aggregate function to each group. For example, to calculate the average age for each city in our DataFrame, we can use the following code:

In [9]:
grouped = df.groupby('city')['age'].mean()
print(grouped)

city
London           19.0
New York         26.0
Paris            33.0
San Francisco    48.0
Name: age, dtype: float64


## Sorting Data
You can sort a DataFrame by one or more columns using the sort_values() function. For example, to sort our DataFrame by age in descending order, we can use the following code:

In [10]:
sorted = df.sort_values(by='age', ascending=False)
print(sorted)

      name  age           city  new_column
3     Dave   48  San Francisco           4
1      Bob   33          Paris           2
0    Alice   26       New York           1
2  Charlie   19         London           3


## Reading and Interacting with Excel Files
Pandas can also read and interact with Excel files. The read_excel() function can read an Excel file into a pandas DataFrame. Let's assume that you have an Excel file named data.xlsx in your working directory that contains the following data:

Pandas can also read and interact with Excel files. Let's assume that you have an Excel file named data.xlsx in your working directory that contains the following data:

| Name    | Age | City         |
|---------|-----|--------------|
| Alice   | 25  | New York     |
| Bob     | 32  | Paris        |
| Charlie | 18  | London       |
| Dave    | 47  | San Francisco|

To read this data into a DataFrame, you can use the following code:

```py
import pandas as pd

# load data into python
df = pd.read_excel('data.xlsx')
print(df)

# create new excel file
df.to_excel('output.xlsx', index=False)
```


