### INTRODUCTION TO PANDAS

#### Introduction to NumPy

NumPy is the foundational Python library for numerical computing.  
It provides efficient array operations, mathematical functions, and tools for linear algebra and random number generation.
Pandas builds on NumPy to provide labeled, table-like data structures for analysis.


In [3]:
import numpy as np

In [8]:
# arrays in numpy

some_array = np.array(object=[1,2,3,4,5,3.5,'fola'])
print(type(some_array))
print(some_array.dtype)
print(some_array)

<class 'numpy.ndarray'>
<U32
['1' '2' '3' '4' '5' '3.5' 'fola']


#### What is Pandas

Pandas is a Python library for data manipulation and analysis.  
It provides two core data structures:
- **Series**: one-dimensional labeled array  
- **DataFrame**: two-dimensional labeled table

Pandas is typically imported using the alias `pd`.  
Most workflows also import NumPy (`np`) since Pandas is built on top of it.


In [9]:
# imports
import pandas as pd

#### Creating a DataFrame.

A **DataFrame** is a collection of Series objects sharing the same index.  
It represents tabular data with rows and columns.

#### *Two ways of Creating dataframes*:
- list of lists
- dictionary

In [12]:
# create a pandas dataframe

# method 1

students = [
    ['fola',23,'m'],
    ['kola',34,'f'],
    ['tobi',45,'m'],
    ['titi',54,'f']
]
columns=['name','age','gender']
students_df = pd.DataFrame(data=students, columns=columns)
students_df


Unnamed: 0,name,age,gender
0,fola,23,m
1,kola,34,f
2,tobi,45,m
3,titi,54,f


In [15]:
# method 2 


students_dict = {
    'name': ['tola','tito','mike','ronaldo'],
    'age': [23,4,56,43],
    'gender': ['m','f','f','m']
}

students_df_two = pd.DataFrame(data=students_dict)
students_df_two


Unnamed: 0,name,age,gender
0,tola,23,m
1,tito,4,f
2,mike,56,f
3,ronaldo,43,m


# Reading and Writing Data
Pandas supports multiple file formats:
- `read_csv()`, `to_csv()`  
- `read_excel()`
- SQL read/write interfaces

In [17]:
insurance_df = pd.read_csv('insurance.csv')
insurance_df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


#### Viewing Data.

Basic inspection functions include:
- `head()` and `tail()` to preview rows  
- `info()` to display structure and data types  
These help understand dataset shape and completeness.


In [21]:
# head

insurance_df.head(n=3)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


In [22]:
# tail 

insurance_df.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [23]:
# info

insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


#### Accessing Columns and Rows

Data can be accessed using:
- Column names (`df['col']`)  
- Row indices (`df.loc[]`, `df.iloc[]`)  
`loc` is label-based, while `iloc` is integer-based.


In [36]:
# single column

insurance_df['region']

0       southwest
1       southeast
2       southeast
3       northwest
4       northwest
          ...    
1333    northwest
1334    northeast
1335    southeast
1336    southwest
1337    northwest
Name: region, Length: 1338, dtype: object

In [32]:
# multiple columns

insurance_df[['age','sex']]

Unnamed: 0,age,sex
0,19,female
1,18,male
2,28,male
3,33,male
4,32,male
...,...,...
1333,50,male
1334,18,female
1335,18,female
1336,21,female


In [28]:
# single row

insurance_df.iloc[1]

age                18
sex              male
bmi             33.77
children            1
smoker             no
region      southeast
charges     1725.5523
Name: 1, dtype: object

In [29]:
# multiple rows

insurance_df.iloc[15:21]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
15,19,male,24.6,1,no,southwest,1837.237
16,52,female,30.78,1,no,northeast,10797.3362
17,23,male,23.845,0,no,northeast,2395.17155
18,56,male,40.3,0,no,southwest,10602.385
19,30,male,35.3,0,yes,southwest,36837.467
20,60,female,36.005,0,no,northeast,13228.84695


#### Indexing and Slicing.

Subsets of data can be selected using:
- Slicing syntax (`df[0:5]`)  
- Conditional filters (`df[df['col'] > 10]`)  
This enables focused data exploration.


In [38]:
# index

insurance_df[2:5]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [42]:
# conditional filters 

insurance_df[insurance_df['sex'] == 'male']

insurance_df[(insurance_df['sex'] == 'male') & (insurance_df['smoker']=='no')]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
8,37,male,29.830,2,no,northeast,6406.41070
...,...,...,...,...,...,...,...
1324,31,male,25.935,1,no,northwest,4239.89265
1325,61,male,33.535,0,no,northeast,13143.33665
1327,51,male,30.030,1,no,southeast,9377.90470
1329,52,male,38.600,2,no,southwest,10325.20600


### Data Types and Conversion
Pandas infers column data types automatically.
  
Use:
- `df.dtypes` to view types  
- `astype()` to convert between types
- `to_datetime` to convert to datetime
-  `to_numeric` to convert to number

In [43]:
insurance_df['children'].dtype

dtype('int64')

In [46]:
insurance_df = insurance_df.astype(
    {'age':'float',
     'children':'float'})

insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   float64
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   float64
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(4), object(3)
memory usage: 73.3+ KB


In [47]:
insurance_df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.900,0.0,yes,southwest,16884.92400
1,18.0,male,33.770,1.0,no,southeast,1725.55230
2,28.0,male,33.000,3.0,no,southeast,4449.46200
3,33.0,male,22.705,0.0,no,northwest,21984.47061
4,32.0,male,28.880,0.0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50.0,male,30.970,3.0,no,northwest,10600.54830
1334,18.0,female,31.920,0.0,no,northeast,2205.98080
1335,18.0,female,36.850,0.0,no,southeast,1629.83350
1336,21.0,female,25.800,0.0,no,southwest,2007.94500


#### handling Missing Data
Missing values appear as `NaN`.  
Useful functions:
- `isna()` to detect  
- `fillna()` to replace  
- `dropna()` to remove


In [52]:
insurance_df.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [None]:
median = insurance_df['age'].median()
insurance_df['age'].fillna(value=median, inplace=True)

In [56]:
insurance_df.dropna(inplace=True)

#### Descriptive Statistics
Use `describe()` for a quick summary of numeric data.  
Other methods include:
- `mean()`, `median()`, `std()` for central tendency  
- `corr()` for relationships between columns


In [57]:
insurance_df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [61]:
insurance_df.select_dtypes(include='float').corr(method='spearman')

Unnamed: 0,age,bmi,children,charges
age,1.0,0.107736,0.056992,0.534392
bmi,0.107736,1.0,0.015607,0.119396
children,0.056992,0.015607,1.0,0.133339
charges,0.534392,0.119396,0.133339,1.0


#### Filtering Data

Use boolean indexing to filter rows.  
Example concept: select rows where a column meets a condition, similar to a SQL WHERE clause.


#### Adding and Modifying Columns
New columns can be created or updated using:
- Arithmetic operations between columns  
- Assignments with new data or expressions


In [None]:
# basic arithmetics
insurance_df['tax'] = insurance_df['charges'] * 0.04
insurance_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,tax
0,19.0,female,27.9,0.0,yes,southwest,16884.924,675.39696
1,18.0,male,33.77,1.0,no,southeast,1725.5523,69.022092
2,28.0,male,33.0,3.0,no,southeast,4449.462,177.97848
3,33.0,male,22.705,0.0,no,northwest,21984.47061,879.378824
4,32.0,male,28.88,0.0,no,northwest,3866.8552,154.674208


In [64]:
# create a new column based on condition

insurance_df['age_class'] = ['teenager' if age < 20 else
                             'adult' if age < 59 else 'old'
                             for age in insurance_df['age']]
insurance_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,tax,age_class
0,19.0,female,27.9,0.0,yes,southwest,16884.924,675.39696,teenager
1,18.0,male,33.77,1.0,no,southeast,1725.5523,69.022092,teenager
2,28.0,male,33.0,3.0,no,southeast,4449.462,177.97848,adult
3,33.0,male,22.705,0.0,no,northwest,21984.47061,879.378824,adult
4,32.0,male,28.88,0.0,no,northwest,3866.8552,154.674208,adult


#### Grouping and Aggregation.


`groupby()` allows summarizing data by one or more keys.  
Common aggregations include:
- `sum()`
- `mean()`
- `count()`


In [68]:
insurance_df[['charges','region']].groupby(by='region',sort=True).count()

Unnamed: 0_level_0,charges
region,Unnamed: 1_level_1
northeast,324
northwest,325
southeast,364
southwest,325
