# Intro:
pandas is an in-memory analysis tool, which has SQL-like constructs, essential statistical and analytic support, as well as graphing capability

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
sns.color_palette("hls", 15)

import pandas as pd


print(f"version:: {pd.__version__}")
print(pd.show_versions())

version:: 2.1.4





INSTALLED VERSIONS
------------------
commit              : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673
python              : 3.10.13.final.0
python-bits         : 64
OS                  : Windows
OS-release          : 10
Version             : 10.0.22621
machine             : AMD64
processor           : AMD64 Family 25 Model 68 Stepping 1, AuthenticAMD
byteorder           : little
LC_ALL              : None
LANG                : None
LOCALE              : English_India.1252

pandas              : 2.1.4
numpy               : 1.26.2
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 68.2.2
pip                 : 23.3.1
Cython              : None
pytest              : None
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : None
html5lib            : None
pymysql             : None
psycopg2            : None
jinja2              : 3.1.2
IPython            

# Data Structure
At the core of pandas are two data structures. `Series` and `Dataframe`

| Data Structure | Dimensionality | Spreadsheet Analog | Database Analog | Linear Algebra |
|----------------|----------------|--------------------|-----------------|----------------|
| Series         |   1D           | Column             | Column          | Column vector  |
| Dataframe      |   2D           | Single Sheet       | Table           | Matrix         |

# Series

## Index Abstraction

In [2]:
# In Python, 
series:dict = {
    'index':[0,1,2,3],
    'data':[145,142,38,13],
    'name':'songs'
}
print(series)

def get(series:dict,idx:int)->int:
    value_idx = series['index'].index(idx)
    return series['data'][value_idx]

print(get(series=series,idx=1))
print(get(series=series,idx=3))



# Another example, bit different
songs:dict = {
    'index':['Paul',"John",'Muthu',"Laksh"],
    'data':[145,142,38,13],
    'name':'counts'
}
print(songs)

print(get(songs,idx='Muthu'))

# This dobule Astraction of the index seems unnecessary but trick up pandas's sleeves. By allowing non-int values like dates,string.

{'index': [0, 1, 2, 3], 'data': [145, 142, 38, 13], 'name': 'songs'}
142
13
{'index': ['Paul', 'John', 'Muthu', 'Laksh'], 'data': [145, 142, 38, 13], 'name': 'counts'}
38


## pandas Series
![Series](./assets/pandas%20series%20img.png)

Even though this looks two-dimensional, remember that the index is not part of the
values

To get the best speed (and to leverage vectorized operations), the `Values` should be of the same type, though this is not required

In [3]:
songs2 = pd.Series([145,142,38,13],name='counts')
print(songs2)
print(f"index:: {songs2.index}")

0    145
1    142
2     38
3     13
Name: counts, dtype: int64
index:: RangeIndex(start=0, stop=4, step=1)


## The NaN values

- stands for `Not A Number` similar to NULL in SQL
- `None`, `NaN`, `nan`,`<NA>`, and `null` are synonyms in this book when referring to empty or missing data
found in a pandas series or dataframe.

In [4]:
nan_series = pd.Series([2,np.nan],index=["Ono",'Clapton'])
print(nan_series)
# Not type is float64 not int64. internally it coerced values to float value

print(f"count function :: {nan_series.count()}")  # Ignoring NaN values
print(f"size property  :: {nan_series.size}")     # Includes missing value

Ono        2.0
Clapton    NaN
dtype: float64
count function :: 1
size property  :: 2


## Optional Integer Support for NaN

In [5]:
nan_series2 = pd.Series([2,None],index=['Ono','Claption'],dtype='Int64')
print(nan_series2)


print(f"count function :: {nan_series2.count()}")  # Ignoring NaN values
print(f"size property  :: {nan_series2.size}")     # Includes missing value

Ono            2
Claption    <NA>
dtype: Int64
count function :: 1
size property  :: 2


## Similar to Numpy
- Index operations
![FIltering with boolean Array](./assets/filteting%20in%20pandas.png)

In [6]:
numpy_ser = np.array([145 , 142 , 38 , 13])
print(f"numpy series index 1 :: {numpy_ser[1]}")
print(f"numpy mean :: {numpy_ser.mean()}")


songs3 = pd.Series(index=['Paul',"John","George","Ringo"],data=[145,142,38,13])
print(f"median of songs :: {songs3.median()}")
mask = songs3 > songs3.median()
print(f"mask::\n{mask}")
print(f"modified songs::\n{songs3[mask]}")

numpy series index 1 :: 142
numpy mean :: 84.5
median of songs :: 90.0
mask::
Paul       True
John       True
George    False
Ringo     False
dtype: bool
modified songs::
Paul    145
John    142
dtype: int64


## Categorical  Data

- If our data is limited to few values, we might want to use categorical data. 
- Benefits:
    -   Use less memory than string
    -   Improve performance
    -   can have an Ordering
    -   can perform operation on categories
    -   Enforce membership on values
- category not limited to string. we can convert numbers or datetime values to category
- `.astype("catgory")` or `dtype='category'`
- `pd.api.types.CategoricalDtype`



| Method | Description |
|--------|-------------|
|pd.Series(data=None, index=None,dtype=None, name=None, copy=False) |  Create a series from data (sequence, dictionary, or scalar).|
| s.index | Access index of series. |
| s.astype(dtype, errors='raise') | Cast a series to dtype. To ignore errors (and return original object) use errors='ignore'.|
| s[boolean_array] | Return values from s where boolean_array is True.
| s.cat.ordered | Determine if a categorical series is ordered.|
| s.cat.reorder_categories( new_categories, ordered=False) |Add categories (potentially ordered) to the series new_categories must include all categories.|

In [7]:
s = pd.Series(['m','l','xs','s','xl'],dtype='category')
print(f"s::\n{s}")

# In nature there is a ordering like small, medium, ... extra-large
print(f"is it ordered:: {s.cat.ordered}")

# To convert into ordered
s2 = pd.Series(data=['m','l','xs','s','xl'])
size_type = pd.api.types.CategoricalDtype(categories=['s','m','l'],ordered=True)
print(f"ordered size_type:: {size_type}")

s3 = s2.astype(size_type)
print(s3)                       # getting NaN because we limited to few sizes
print(f"is it s3 ordered:: {s3.cat.ordered}")


print(f"filtering s3::\n{s3[s3>'s']}")


# Other way,
s = s.cat.reorder_categories(['xs','s','m','l','xl'],ordered=True)
print(f"s::\n{s}")
print(f"is it ordered:: {s.cat.ordered}")

s::
0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']
is it ordered:: False
ordered size_type:: category
0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
is it s3 ordered:: True
filtering s3::
0    m
1    l
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
s::
0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']
is it ordered:: True


In [8]:
colors = pd.Series(['red','blue','green','violet','orange'],dtype='category')
print(colors)

0       red
1      blue
2     green
3    violet
4    orange
dtype: category
Categories (5, object): ['blue', 'green', 'orange', 'red', 'violet']


# Series Deep Dive

there were many operations you can do with a Series

## Loading the Data

If zip file has only one csv,we can directly read `.read_csv`. Whereas multiple files, we would need to decompress the data to pull out the file we are interested in.

In [9]:
df:pd.DataFrame    = pd.read_csv("./assets/vehicles.csv.zip")
city_mpg:pd.Series = df.city08
highway_mpg:pd.Series = df.highway08

  df:pd.DataFrame    = pd.read_csv("./assets/vehicles.csv.zip")


## Series Attributes

There are many ways to categorize these, and I’m roughly going to do it by what the result of the method is:
-   Dunder methods (.__add__,.__iter__,etc) provide many numeric operation, looping, attribute access and index access. For numeric operations, these return Series
-   Corresponding operator methods for many of the numeric operations allow us to tweak the behaviout (.add method in addition to .__add__)
-   Aggregate methods and properties which reduce or aggregate the values in a series down to single scler value like (.mean, .max, .sum, .is_monotonic property)
-   conversion methods. some of these start with .to_ and export the data to other format (parquet,sql,excel,csv,json..)
- manipulation methods. some of like (.sort_values, .drop_duplicates) returns Series with same index
-   Indexing and accessor methods and attributes like .loc and .iloc (Series or Scaler)
-   String manipulation methods using .str
-   Date   manipulation methods using .dt
-   Plotting methods  using .plot
-   Categorical manipaulation using   .cat
-   PLotting methods using .plot
-   Transformation methods such as .unstack, .reset_index, .agg, .transform
-   Attributes such as .index and .dtype
-   A bunch of private attributes that we will ignored.

In [10]:
print(f"how many functions associated with Series:: {len([ _ for _ in dir(pd.Series) if not _.startswith('_') ])}")

how many functions associated with Series:: 207


# Operators

## Intro

## Dunder Methods

## Index Alignment

## Broadcasting

## Iteration

## Operator Methods

## Chaining

# Aggregate Methods

## Aggregation

## Count and Mean of an Attribute

## .agg and Aggregation Strings

# Conversion Methods