# Basics

- Pandas is Data analysis library
- built on numpy, but adds labeling and high level data operations
- Data structures in Pandas : Series (1D labeled array), DataFrame (2D labeled table)

 Use:
- easy to read/write CSV, Excel, JSON
- data cleaning - handle missing, dupes, outliers
- data slicing, filter, group
- vectorized operation - faster than python loops

# Series
- like column in excel
- 1D
- has indexed labels + values

In [1]:
import pandas as pd

s = pd.Series([10, 20, 30, 40])
s

0    10
1    20
2    30
3    40
dtype: int64

- internally it stores data using Numpy arrays (fast)
- stores index labels using index object (hashable for quick lookup)

### Custom Index

In [3]:
s = pd.Series([10, 20, 30], index=['a','b','c'])
s

a    10
b    20
c    30
dtype: int64

### Access Data

In [8]:
s['b']

20

### Operations

In [10]:
s+5

a    15
b    25
c    35
dtype: int64

# DataFrames
- like an excel sheet
- columns can have different data types
- each column = a Series
- labeled rows(index) and columns

### Creating DataFrame

In [None]:
# 1. From Dictionary

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Pune', 'Delhi', 'Mumbai']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,Pune
1,Bob,30,Delhi
2,Charlie,35,Mumbai


In [None]:
# 2. From lists of Dictionary

data = [
    {'Name': 'Alice', 'Age': 25},    # missing value will be NAN
    {'Name': 'Bob', 'Age': 30, 'City': 'Delhi'},
]

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,
1,Bob,30,Delhi


### Access
- Accessing Columns
- Accessing Rows:
    - ```.loc[row_label]``` - label based
    - ```.iloc[row_position]``` - integer based

In [None]:
df['Name']
df.Name  # works only when there is no space and column name is continuos e.g. work days

0    Alice
1      Bob
Name: Name, dtype: object

In [18]:
## Accessing Rows

df.loc[0]
df.iloc[0]

Name    Alice
Age        25
City      NaN
Name: 0, dtype: object

## DataFrame Attributes
- shape - (rows, cols)
- columns - lists col names with dtype as 'object'
- index - shows range of indexes like start=0, stop=3
- dtypes - data types of each col
- head 
- tail
- info - non null count, dtypes
- describe - summary stats for num cols

In [28]:
print(df.shape)
print(df.columns)
print(df.index)
print(df.dtypes)

(2, 3)
Index(['Name', 'Age', 'City'], dtype='object')
RangeIndex(start=0, stop=2, step=1)
Name    object
Age      int64
City    object
dtype: object


In [29]:
print(df.head())
print(df.tail())
print(df.info())
print(df.describe())

    Name  Age   City
0  Alice   25    NaN
1    Bob   30  Delhi
    Name  Age   City
0  Alice   25    NaN
1    Bob   30  Delhi
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    2 non-null      object
 1   Age     2 non-null      int64 
 2   City    1 non-null      object
dtypes: int64(1), object(2)
memory usage: 180.0+ bytes
None
             Age
count   2.000000
mean   27.500000
std     3.535534
min    25.000000
25%    26.250000
50%    27.500000
75%    28.750000
max    30.000000


# Import and Export Data

In [31]:
df = pd.read_csv("C:/Dataset-DS/students.csv")

#index_col : specify which col to use as index

In [32]:
df.to_csv('output.csv', index=False)

# index=False avoids writing row index into csv

In [None]:
# needs openpyxl

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df.to_excel('output.xlsx', index=False)