## Python Complete Notes

Pandas is a powerful and popular open-source data manipulation and analysis library for Python. It is widely used for data science, data analysis, and machine learning tasks due to its ease of use and ability to handle large datasets efficiently. Here are some key features and functionalities of Pandas:

### 1. Data Structures
Pandas primarily provides two data structures for manipulating data:

- **Series**: A one-dimensional labeled array capable of holding any data type. Think of it as a column in a table.
- **DataFrame**: A two-dimensional labeled data structure with columns that can be of different types. It is similar to a table in a database or an Excel spreadsheet.

### 2. Data Handling
Pandas offers robust functions to handle and manipulate data, including:

- **Data Loading**: Import data from various file formats such as CSV, Excel, SQL databases, JSON, HTML, and more.
- **Data Cleaning**: Handle missing data, remove duplicates, and perform various transformations.
- **Data Merging**: Combine data from multiple sources using joins, concatenation, and merging techniques.
- **Data Aggregation**: Group data and perform aggregate operations like sum, mean, count, etc.

### 3. Data Analysis
Pandas makes it easy to perform complex data analysis tasks:

- **Descriptive Statistics**: Quickly generate summary statistics for your data.
- **Time Series Analysis**: Efficiently handle and manipulate time series data.
- **Data Visualization**: Easily plot data using integrated plotting functions that work well with Matplotlib.

### 4. Indexing and Selection
Pandas provides powerful tools for data selection and subsetting:

- **Label-based Indexing**: Access data using labels (row and column names).
- **Position-based Indexing**: Access data using integer location-based indexing.
- **Boolean Indexing**: Filter data based on conditions.

### 5. Performance
Pandas is designed for performance and can handle large datasets efficiently:

- **Optimized Operations**: Many operations are implemented in C or Cython for speed.
- **Memory Efficiency**: Efficient handling of data in memory to minimize usage.

### Example Usage
Here is a simple example to illustrate the usage of Pandas:

```python
import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Calculate the average age
average_age = df['Age'].mean()
print(f'Average Age: {average_age}')

# Filter data
filtered_df = df[df['Age'] > 30]
print(filtered_df)
```

In this example, we create a DataFrame, display it, calculate the average age, and filter rows based on a condition.

Pandas is a versatile and essential tool in the data scientist's toolkit, making data manipulation and analysis both easy and efficient.

## Intro to Dataframes

In [37]:
import pandas as pd
import numpy as np

In [4]:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df.head()

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [5]:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["A","B","C"])
df.head()

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


In [6]:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["A","B","C"], index=["X","Y","Z"])
df.head()

Unnamed: 0,A,B,C
X,1,2,3
Y,4,5,6
Z,7,8,9


In [9]:
df.tail(2)

Unnamed: 0,A,B,C
Y,4,5,6
Z,7,8,9


In [10]:
df.columns

Index(['A', 'B', 'C'], dtype='object')

In [7]:
df.index

Index(['X', 'Y', 'Z'], dtype='object')

In [11]:
df.index.to_list()

['X', 'Y', 'Z']

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, X to Z
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
 2   C       3 non-null      int64
dtypes: int64(3)
memory usage: 96.0+ bytes


In [13]:
df.describe()

Unnamed: 0,A,B,C
count,3.0,3.0,3.0
mean,4.0,5.0,6.0
std,3.0,3.0,3.0
min,1.0,2.0,3.0
25%,2.5,3.5,4.5
50%,4.0,5.0,6.0
75%,5.5,6.5,7.5
max,7.0,8.0,9.0


In [14]:
df.nunique()

A    3
B    3
C    3
dtype: int64

In [15]:
df['A'].unique()

array([1, 4, 7])

In [17]:
df.shape

(3, 3)

In [18]:
df.size

9

## Loading in Dataframes from Files

In [19]:
## Load the CSV file

coffee = pd.read_csv('warmup-data/coffee.csv')

In [20]:
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35


In [21]:
result = pd.read_parquet('data/results.parquet')

In [22]:
result.head()

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,


In [23]:
result1 = pd.read_feather('data/results.feather')

In [24]:
result1.head()

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,


In [27]:
result2 = pd.read_excel('data/olympics-data.xlsx')

In [28]:
result2.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17
3,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,France,168.0,64.0,1978-03-20
4,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,France,,,1930-07-25


In [30]:
display(coffee)

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30
8,Friday,Espresso,45
9,Friday,Latte,35


In [34]:
coffee.head(6)

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35
5,Wednesday,Latte,25


In [35]:
coffee.tail(7)

Unnamed: 0,Day,Coffee Type,Units Sold
7,Thursday,Latte,30
8,Friday,Espresso,45
9,Friday,Latte,35
10,Saturday,Espresso,45
11,Saturday,Latte,35
12,Sunday,Espresso,45
13,Sunday,Latte,35


In [36]:
coffee.sample(10) # Pass in random_state to make deterministic

Unnamed: 0,Day,Coffee Type,Units Sold
13,Sunday,Latte,35
10,Saturday,Espresso,45
12,Sunday,Espresso,45
5,Wednesday,Latte,25
7,Thursday,Latte,30
2,Tuesday,Espresso,30
4,Wednesday,Espresso,35
8,Friday,Espresso,45
6,Thursday,Espresso,40
0,Monday,Espresso,25


In [38]:
coffee.loc[0]

Day              Monday
Coffee Type    Espresso
Units Sold           25
Name: 0, dtype: object

In [39]:
coffee.loc[[0,1,2]]

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30


In [40]:
coffee.loc[0:3]

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20


In [41]:
coffee.loc[0:]

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30
8,Friday,Espresso,45
9,Friday,Latte,35


In [42]:
coffee.loc[:5]

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35
5,Wednesday,Latte,25


In [45]:
coffee.loc[5:12,["Day", "Coffee Type"]]

Unnamed: 0,Day,Coffee Type
5,Wednesday,Latte
6,Thursday,Espresso
7,Thursday,Latte
8,Friday,Espresso
9,Friday,Latte
10,Saturday,Espresso
11,Saturday,Latte
12,Sunday,Espresso


In [46]:
coffee.iloc[0:8,[0,2]]

Unnamed: 0,Day,Units Sold
0,Monday,25
1,Monday,15
2,Tuesday,30
3,Tuesday,20
4,Wednesday,35
5,Wednesday,25
6,Thursday,40
7,Thursday,30


In [57]:
coffee.loc[1,"Units Sold"] = 20

In [58]:
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,20
2,Tuesday,Espresso,20
3,Tuesday,Latte,20
4,Wednesday,Espresso,20


In [59]:
coffee.loc[1:3,"Units Sold"] = 10

In [60]:
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,20
2,Tuesday,Espresso,20
3,Tuesday,Latte,20
4,Wednesday,Espresso,20


In [61]:
coffee.at[0,"Units Sold"]

np.int64(25)

In [63]:
coffee.iat[0,0]

'Monday'