## Introduction to Pandas


### Brief Introduction to Pandas

Pandas is a Python library widely used for data manipulation and analysis. It provides powerful, expressive, and flexible data structures that make it easy to work with structured (tabular, multidimensional, potentially heterogeneous) and time-series data. Pandas is built on top of the Numerical Python (NumPy) package, which means Pandas utilizes many structures from NumPy or builds upon them.

In [None]:
# Importing pandas library
import pandas as pd

### How to Install Pandas

If you're using a distribution like Anaconda, chances are Pandas is already installed. If not, you can install it using pip or conda.

```shell
# Using pip
pip install pandas

# Using conda
conda install pandas
```

### Basics of Pandas: Series & DataFrames

Pandas has two main data structures:

**`Series`**: It's a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It's essentially a single column of data.

In [None]:
# Creating a series
series = pd.Series([1, 2, 3, 4])
print(series)

0    1
1    2
2    3
3    4
dtype: int64


```
Output:

0    1
1    2
2    3
3    4
dtype: int64
```

**`DataFrame`**: It's a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used pandas object.

In [None]:
# Creating a DataFrame
df = pd.DataFrame({
   "Name": ["Alice", "Bob", "Charlie"],
   "Age": [25, 32, 22]
})
print(df)

      Name  Age
0    Alice   25
1      Bob   32
2  Charlie   22


```
Output:

      Name  Age
0    Alice   25
1      Bob   32
2  Charlie   22
```

### Creating DataFrames from scratch

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dictionary.

In [None]:
# Creating a DataFrame from a dictionary
data = {
   "fruits": ["apple", "banana", "cherry"],
   "count": [10, 20, 15]
}
df = pd.DataFrame(data)
print(df)

   fruits  count
0   apple     10
1  banana     20
2  cherry     15


```
Output:

   fruits  count
0   apple     10
1  banana     20
2  cherry     15
```

## Exercise: Getting Started with Pandas

Now that you've learned the basics of Pandas, let's practice what you've learned using a set of exercises with a common context.

For these exercises, let's imagine you are a data scientist working for a fruit vendor. You have data about the types of fruit you sell, their colors, and the quantity of each fruit you have in stock.

### Exercise 1: Create a DataFrame

First, let's create a Series object that represents the quantity of each fruit in stock. The fruit names will be the index, and their quantities will be the values. Here are the fruit names and their quantities:

- Apples: 20
- Bananas: 30
- Cherries: 15
- Dates: 10

The expected print of the DataFrame:
```
fruits  count
0   Apples     20
1  Bananas     30
2  Cherries     15
3  Dates        10
```

In [None]:
data = {"Fruits": [ "Apples", "Bananas", "Cherries", "Dates"], "Counts": [ 20, 30, 15, 10]}
result = pd.DataFrame(data)
print(result)

     Fruits  Counts
0    Apples      20
1   Bananas      30
2  Cherries      15
3     Dates      10



## Understanding Index in Pandas

In both the Series and DataFrame structures in Pandas, the Index plays a crucial role. The index is essentially the "name" of each row or item. For instance, in a Series object, the index is the leftmost column, and in a DataFrame, it is the first column on the left.

By default, when we create a new Series or DataFrame without specifying an index, Pandas will automatically create a numeric index that starts from zero. Here is an example:

In [None]:
# Without explicit index
series = pd.Series([1, 2, 3, 4])
print(series)

```
Output:

0    1
1    2
2    3
3    4
dtype: int64
```

However, we can define our own index, which can be of any immutable data type, such as strings, numbers, or dates:

In [None]:
# With explicit index
series = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
print(series)

```
Output:

a    1
b    2
c    3
d    4
dtype: int64
```

We can think of the Pandas index as an immutable array or an ordered set (technically a multi-set, as Index objects may contain repeated values).

For DataFrame, indexes work in a similar way:

In [None]:
df = pd.DataFrame({
   "Name": ["Alice", "Bob", "Charlie"],
   "Age": [25, 32, 22]
}, index=["a", "b", "c"])
print(df)

```
Output:

    Name     Age
a   Alice    25
b   Bob      32
c   Charlie  22
```

We can access, modify, and apply various operations on the index, which we will cover in the later sections of this course.

## Saving DataFrame to CSV

Pandas provide a simple and efficient way to save your DataFrame to a CSV file using the `to_csv()` function. Here is how you can use it:

In [None]:
# Save the DataFrame to a CSV file
df.to_csv('my_data.csv', index=False)

Remember to replace `'my_data.csv'` with your desired filename.

The `index=False` argument is used to prevent pandas from writing row names. If you want to include the index, you can omit this argument.


### Reading from CSV/Excel/SQL databases

Pandas provides handy functions to read data from various sources. Some commonly used functions are:

- `pd.read_csv(filename)`: Reads a comma-separated values (csv) file and returns DataFrame.
- `pd.read_excel(filename)`: Reads an Excel file and returns DataFrame.
- `pd.read_sql(query, connection_object)`: Reads a SQL query and returns DataFrame.

In [None]:
# Reading data from a CSV file
df = pd.read_csv('file.csv')

# Reading data from an Excel file
df = pd.read_excel('file.xlsx')

# Reading data from a SQL database
from sqlalchemy import create_engine

engine = create_engine('sqlite:///:memory:')  # Creates a temporary SQLite database
df = pd.read_sql('SELECT * FROM my_table', engine)

Please remember to replace the filenames and database queries with the ones you'll be using.

These basics should give you a good start to use pandas effectively. We'll cover more advanced pandas features in future lessons.



### Exercise 2: Create a DataFrame

Now, let's add some more information about each fruit. In addition to the fruit names and their quantities, you also have the following data:

- Colors: Apples are red, bananas are yellow, cherries are red, dates are brown.
- Price per kg: Apples cost $3 per kg, bananas cost $2 per kg, cherries cost $4 per kg, dates cost $5 per kg.

Create a DataFrame that includes all of this data.
- Once the DataFrame is created and its contents are verified as correct, save the DataFrame to a .csv file named store_1_stock.csv.csv`

The expected print of the DataFrame:
```
          Quantity   Color  Price per kg
Apples          20     Red             3
Bananas         30  Yellow             2
Cherries        15     Red             4
Dates           10   Brown             5
```
### Exercise 3: Read Data from a CSV File

Assume you have a file named `store_1_stock.csv` in your current directory. Read this file into a DataFrame. You will notice an additional column 'Unnamed: 0', which is the index of your DataFrame saved as a separate column in the CSV file.

In [None]:
store_1_stock = pd.read_csv('store_1_stock.csv')
print(store_1_stock)

The expected print of the DataFrame:
```
  Unnamed: 0  Quantity   Color  Price per kg
0     Apples        20     Red             3
1    Bananas        30  Yellow             2
2   Cherries        15     Red             4
3      Dates        10   Brown             5
```