# Getting Started with Pandas

## Importing pandas

In order to leverage the capabilities of pandas for data manipulation and analysis, the first step is to import the library into your Python environment.

To import the pandas library, simply use the following code:

In [1]:
import pandas as pd

By convention we use the `pd` alias, which allows for simpler access to pandas methods.

## Creating Series and DataFrames

Pandas provides two fundamental data structures: Series and DataFrames. 
Understanding these is crucial for effectively working with data using the pandas library.

## An introduction to Series

Pandas Series are one-dimensional labeled arrays capable of holding any data type.

In [2]:
# A sample Series for product sales quantities
data = [15, 10, 23, 8, 12]
index = ['Product A', 'Product B', 'Product C', 'Product A', 'Product C']
sales = pd.Series(data, index=index)
sales

Product A    15
Product B    10
Product C    23
Product A     8
Product C    12
dtype: int64

### Accessing elements in a Series

Access the quantity of `Product A` sold:

In [3]:
sales['Product A']

Product A    15
Product A     8
dtype: int64

### Add a new product

In [4]:
sales['Product D'] = float('nan')
sales

Product A    15.0
Product B    10.0
Product C    23.0
Product A     8.0
Product C    12.0
Product D     NaN
dtype: float64

### Basic operations

Calculate the total quantity of all products sold:

In [5]:
sales.sum()

68.0

Calculate the average quantity sold per product:

In [6]:
sales.mean()

13.6

You can also perform element-wise operations, for example double all the quantities:

In [7]:
sales * 2

Product A    30.0
Product B    20.0
Product C    46.0
Product A    16.0
Product C    24.0
Product D     NaN
dtype: float64

## An introduction to DataFrames

Pandas DataFrames are two-dimensional labeled data structures with columns of potentially different types.

In [8]:
# A sample DataFrame of employee information
data = {
    'employee_id': [2, 3, 4, 7, 8, 9],
    'name': ['Sal', 'Yang', 'Khaya', 'Lin', 'Eve', 'Mike'],
    'department': ['Sales', 'Marketing', 'Engineering', 'Sales', 'Engineering', 'Sales'],
    'salary': [60000, 75000, 80000, 62000, 90000, 70000]
}
employees = pd.DataFrame(data)
employees

Unnamed: 0,employee_id,name,department,salary
0,2,Sal,Sales,60000
1,3,Yang,Marketing,75000
2,4,Khaya,Engineering,80000
3,7,Lin,Sales,62000
4,8,Eve,Engineering,90000
5,9,Mike,Sales,70000


### Viewing DataFrames

You can use the `head()` and `tail()` methods to view the top and bottom rows of dataframes. 
By default five rows are shown but we can specify how many we want like this `df.head(10)`.

We can also look at a random sample of the data like this:

In [9]:
employees.sample()

Unnamed: 0,employee_id,name,department,salary
2,4,Khaya,Engineering,80000


Multiple samples:

In [10]:
employees.sample(3)

Unnamed: 0,employee_id,name,department,salary
0,2,Sal,Sales,60000
5,9,Mike,Sales,70000
3,7,Lin,Sales,62000


Basic summary statistics:

In [11]:
employees.describe()

Unnamed: 0,employee_id,salary
count,6.0,6.0
mean,5.5,72833.333333
std,2.880972,11321.071798
min,2.0,60000.0
25%,3.25,64000.0
50%,5.5,72500.0
75%,7.75,78750.0
max,9.0,90000.0


Infomation summary about the data:

In [12]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   employee_id  6 non-null      int64 
 1   name         6 non-null      object
 2   department   6 non-null      object
 3   salary       6 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes


### Seleting colums

In [13]:
employees[['name', 'department']]

Unnamed: 0,name,department
0,Sal,Sales
1,Yang,Marketing
2,Khaya,Engineering
3,Lin,Sales
4,Eve,Engineering
5,Mike,Sales


### Selecting rows based on a condition

In [14]:
engineering_employees = employees[employees['department'] == 'Engineering']
engineering_employees

Unnamed: 0,employee_id,name,department,salary
2,4,Khaya,Engineering,80000
4,8,Eve,Engineering,90000
