# PANDAS

Pandas is a popular open-source Python library used for data manipulation and analysis. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008. It provides easy-to-use data structures and functions for working with structured data, such as tabular data, time series data, and more. Pandas is widely used in data science, data analysis, and data preprocessing tasks and is an essential tool in the data science ecosystem.

Key features and components of Pandas include:

1. **Data Structures**:
   - **DataFrame**: A two-dimensional, labeled data structure similar to a spreadsheet or SQL table. It consists of rows and columns and is the primary data structure used in Pandas. Each column can have a different data type.
   - **Series**: A one-dimensional labeled array that can hold data of any type. Series objects are the building blocks of DataFrames.

2. **Data Import and Export**:
   - Pandas can read data from various file formats, including CSV, Excel, SQL databases, JSON, and more.
   - It can write data back to these formats as well, making it easy to exchange data with external systems.

3. **Data Cleaning and Transformation**:
   - Pandas provides numerous functions for data cleaning and transformation, including handling missing data, removing duplicates, reshaping data, and more.
   - It supports data alignment and merging, similar to SQL joins.

4. **Data Indexing and Selection**:
   - You can use labels or integer-based indexing to select and filter data within DataFrames and Series.
   - Boolean indexing allows for complex data selection based on conditions.

5. **Grouping and Aggregation**:
   - Pandas supports grouping data by one or more columns and applying various aggregation functions like sum, mean, count, and custom functions.
   - Grouping is often used for data summarization and analysis.

6. **Time Series Analysis**:
   - Pandas has extensive support for time series data, including date and time indexing, resampling, and rolling calculations.
   - It is widely used for financial and temporal data analysis.

7. **Data Visualization Integration**:
   - Pandas can integrate with popular data visualization libraries like Matplotlib and Seaborn for creating visualizations directly from DataFrames and Series.

8. **Interoperability**:
   - Pandas can be used in conjunction with other libraries in the Python data ecosystem, such as NumPy, SciPy, and scikit-learn, to perform comprehens['Age'].mean())  # Calculate the mean age
```

Pandas simplifies data analysis and manipulation in Python, making it a go-to library for tasks such as data cleaning, exploration, transformation, and visualization. Its intuitive syntax and powerful capabilities make it an indispensable tool for data professionals and researchers working with structured data.

# Let's Go

In [1]:
import pandas as pd

In [2]:
pd.__version__

'2.1.1'

## Pandas Series 

A Pandas Series is a one-dimensional labeled array-like data structure in Python. It is a fundamental building block of Pandas and is designed to hold data of a single data type. Series objects are similar to one-dimensional NumPy arrays but come with additional indexing capabilities and powerful data manipulation tools.

In [5]:
# pandas series

# we can create pandas series using pandas Series() method.

# series from list

my_list = [1, 2, 3, 4, 5]
print("List: ", my_list)

print()

my_ser = pd.Series(my_list)
print("Series:")
print(my_ser)

List:  [1, 2, 3, 4, 5]

Series:
0    1
1    2
2    3
3    4
4    5
dtype: int64


In [6]:
type(my_ser)

pandas.core.series.Series

In series, each value is assigned with their label. If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.
Like above code, we not specify any labels for values, defualt is counting from 0.

In [7]:
# Create a Series with custom index labels
data = {'A': 10, 'B': 20, 'C': 30, 'D': 40, 'E': 50}
series = pd.Series(data)

print(series)

A    10
B    20
C    30
D    40
E    50
dtype: int64


In [9]:
# Create a Series with explicit index and data type
data = [1.1, 2.2, 3.3, 4.4, 5.5]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index, dtype=float)
print(series)

a    1.1
b    2.2
c    3.3
d    4.4
e    5.5
dtype: float64


We can access elements of a Series using the index label, perform operations on Series, and use them as part of more complex data manipulation tasks.

In [12]:
value = series['c']  # Access the element with index 'c'
print(value, end="\n\n")

subset = series[series > 3]  # Filter elements greater than 3
print(subset, end="\n\n")

sum_values = series + 2
print(sum_values)

3.3

c    3.3
d    4.4
e    5.5
dtype: float64

a    3.1
b    4.2
c    5.3
d    6.4
e    7.5
dtype: float64


## Pandas DataFrame

A Pandas DataFrame is a two-dimensional, size-mutable, and highly flexible data structure that is at the core of the Pandas library. It is similar to a spreadsheet or SQL table and is designed to store and manipulate data in a tabular format, where data is organized into rows and columns. DataFrames are a fundamental data structure for data analysis and manipulation in Python.

In [13]:
# DataFrames

# we can create dataframes in python using pandas DataFrame() method.

# creation of dataframe using dictionary as data
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)

In [14]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


In [15]:
type(df)

pandas.core.frame.DataFrame

In [21]:
# locating rows
# Pandas use the loc attribute to return one or more specified row(s)

row1 = df.loc[0]
print(row1)
print(type(row1))                 # returns series  type

row1 = df.loc[0:2]
print(row1)
print(type(row1))                 # returns DataFrame type

Name       Alice
Age           25
City    New York
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
<class 'pandas.core.frame.DataFrame'>


We can also specify owr own index names during dataframe creation

In [22]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 

      calories  duration
day1       420        50
day2       380        40
day3       390        45


In [23]:
# Reading Files as DataFrame

In [None]:
# csv file:
# we can read csv files in pandas using read_csv() method and can store files data as dataframe.

df = pd.read_csv("/path/to/file", )

# here we have many arguments in read_csv() method. one important argument is `sep`, this ensures that what symbol makes file data seperated. Like in 
# csv file ',' comma seperation

In [28]:
# json files:
# for reading json file data, python have read_json()

df = pd.read_json('data.json')
df

Unnamed: 0,person
address,"{'street': '123 Main Street', 'city': 'Anytown..."
age,30
email,johndoe@example.com
first_name,John
is_student,False
last_name,Doe
phone_numbers,"[{'type': 'home', 'number': '555-123-4567'}, {..."


In [29]:
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300
