# Pandas

Pandas is a powerful data analysis library for Python. 

It provides two main data structures: 

 - **DataFrame** 
 - **Series**
 
A DataFrame is a key data structure in data analysis and manipulation in the context of Python programming language.

Pandas provides a wide range of functions and methods specifically designed for working with DataFrames. 

These functions and methods allow users to perform tasks such as data cleaning, filtering, grouping, merging, and statistical analysis efficiently and effectively. 

Pandas simplifies complex data operations 

Pandas is a valuable module for data scientists, analysts, and researchers working with tabular data..


In [None]:
import pandas as pd

In [None]:
data = {'Name':['Alice','Bob','Charlie'],
       'Age':[21,20,22]}

print(data)

In [None]:
# Creating a DataFrame

df = pd.DataFrame(data)
student = {"Name": ['Anna', 'Mark', 'Vanessa'], "Favorite Subjects": ["Theory", "Python","Data Structures"], 'Age': [21,22,19]}
df2 = pd.DataFrame(student)
df2

In [None]:
print(type(data))
print(type(df))

In [None]:
# accessing value by using key
print(data['Name'])

In [None]:
# accessing value by using key and list index.
print(data['Name'][0])

In [None]:
#Accessing column
# We use indexing to access columns
df['Name']
df2["Favorite Subjects"]

In [None]:
# how to access row in Name column? 
df['Name'][0]
df2["Favorite Subjects"][1]

In [None]:
filtered_data = df[df['Age']>21]

print(filtered_data)
print(filtered_data['Name'])

In [None]:
type(filtered_data)

###   <font color='red'> (Question 1 for compensating missing score from midterm:)</font>
Can you filter names whose age is greater than 21 from data dictionary?

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [20, 21, 22]
}


# Solution 1
# for i in range(len(data['Age'])):
#     if data['Age'][i] > 21:
#         print(data['Name'][i])

# Solution 2
# filtered = [each for i, each in enumerate(data['Name']) if data['Age'][i]>21]
# filtered
        
# Expected output: Charlie
for i in range(len(data["Age"])):
        if data["Age"][i] > 21:
            print(data["Name"][i])


        

In [None]:
#Performing calculations 

average_age = df['Age'].mean()
average_age2 = df2["Age"].mean()
median = df2["Age"].median()
print(average_age)
print(average_age2)
print(median)

###   <font color='red'> (Question 2 for compensating missing score from midterm:)</font>

Can you calculate the average age from this 'data' dictionary? 

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [21, 20, 22]
}

sum = 0 



#solution 1
# for i in range(len(data['Age'])):
#     sum += data['Age'][i]
    
    
# sum / len(data['Age'])

# Solution 2

for age in data['Age']:
    sum = sum + age
    
print(sum/len(data['Age']))

#solution 3

#sum(data['Age'])/len(data['Age'])

# Expected output: 21.0


In [None]:
# Manipulating data

df['Age']=df['Age']*2
df2["Age"] = df2["Age"] + 2
df2

This creates a DataFrame with 2 columns: 'Name' and 'Age', each containing corresponding data. You can perform various operations on this DataFrame, such as selecting specific columns, filtering rows, and applying mathematical operations to columns. 

**DataFrames** are a fundamental tool for data analysis in Python, providing a convenient and efficient way to work with structured data.

## Series DataStructure in Pandas

In pandas, a Series is a one-dimensional labeled array that can hold any data type. It is a **fundamental data structure in pandas** and is similar to a **column in a spreadsheet or a single variable in statistics**. Series can be created from lists, arrays, dictionaries, or other Series objects.

Here's a summary of key points about pandas Series:

- One-Dimensional: Series is one-dimensional, meaning it consists of a single sequence of data values.

- Labeled: Each element in a Series has a label, which is its index. You can access elements using these labels.

- Heterogeneous Data Types: A Series can hold data of any data type, including integers, floats, strings, and objects.

- Creation: Series can be created from various data structures such as lists, arrays, dictionaries, or even other Series objects.

- Operations: You can perform various operations on Series, including mathematical operations, slicing, indexing, and more.


In [None]:
import pandas as pd

# Creating a Series from a list
data = [1, 2, 3, 4, 5]

series = pd.Series(data)
type(series)

In [None]:
print(type(data))
print(type(series))

In [None]:
# Accessing elements using index in list datastructure
print(data[0])

# Accessing elements using index in series data structure
print(series[0])  

In [None]:
# Performing mathematical operations
result = series * 2
print(result)
print(series) # Not modified

In [None]:
# Creating a Series from a dictionary
data_dict = {'a': 1, 'b': 2, 'c': 3}

series_from_dict = pd.Series(data_dict)
print(series_from_dict)
# Notice that now the labels are not the indexes but the keys 

student2 = {'Anna': 21, "Jane": 23, "Vanessa": 24}
s2 = pd.Series(student2)
s2


In [None]:
# Creating a Series with custom index

data = [1, 2, 3, 4, 5]
custom_index = ['A', 'B', 'C', 'D', 'E']

series_with_custom_index = pd.Series(data, index=custom_index)

print(series_with_custom_index)

**Note:** Understanding Series is essential for working effectively with pandas, especially when dealing with one-dimensional data or individual columns in a DataFrame.

In [None]:
# Create series of cars
cars = pd.Series(['BMW', 'Toyota', 'Opel'])

#Create series of colors
colors = pd.Series(['red', 'white', 'black'])

# Creating a series with a custom index
s3 = pd.Series(['BMW', 'Toyota', 'Opel'], index=colors)
print(s3)

#combine these data Series into dataFrame. 

car_df = pd.DataFrame({'Cars': cars, 'colors': colors})

# print data frame
# print(car_df)


###   <font color='red'> (Question 3 for compensating missing score from midterm:)</font>

- Make a list of different foods.
- Make a list of different forint values (these can be integers).
- Combine your lists of foods and dollar values into a DataFrame.


Note: Make sure your two Series are the same size before combining them in a DataFrame.

In [None]:
foods = ["fries", "steak", "rice"]
forints = [12,45,67]
frame = pd.DataFrame({"Foods": foods, "Forints": forints})
print(frame)

###   <font color='red'> (Question 4 for compensating missing score from midterm:)</font>
Who can add your name and age into df by using series data structure? 

In [49]:
import pandas as pd
data = {'Name':['Alice','Bob','Charlie'],
       'Age':[21,20,22]}

df = pd.DataFrame(data)


new_row = pd.Series({"Name": "Collins", "Age": 21})
# df["Subjects"] = ["DSA", "Analysis", "Python"] # Add a column

df = df.append(new_row, ignore_index = True)
print(df)


AttributeError: 'DataFrame' object has no attribute 'append'

In [None]:
print(df)

# 1. Importing/Loading data with pandas

Creating `Series` and `DataFrame`'s from scratch 


What we usually be doing is importing your data in the form of a `.csv` (comma separated value) or spreadsheet file. 

CSV is lightweigth, compatible with different apps, .... 

Pandas allows for easy importing of data like this through functions such as `pd.read_csv()` and `pd.read_excel()` (for Microsoft Excel files).

Importing and converting data into a pandas `DataFrame`. 

**DataFrame** structured data is easy to manipulate and analyze. 

Having your data available in a `DataFrame` allows you to take advantage of all of pandas functionality on it.

Another common practice you'll see is data being imported to `DataFrame` called `df` (short for `DataFrame`).



In [None]:
# import car sales data, take file name as a string
car_sales_df = pd.read_csv('/home/gerel/Documents/DATA/car_sales.csv')

In [None]:
type(car_sales_df)

In [None]:
car_sales_df

## 2. Exploring data

One of the first things you'll want to do after you import some data into a pandas `DataFrame` is to start exploring it.

pandas has many built in functions which allow you to quickly get information about a `DataFrame`.

Let's explore some using the `car_sales` `DataFrame`.

In [None]:
car_sales_df

In [None]:
# .dtypes attribute shows us what datatype each column contains.
car_sales_df.dtypes


In the context of Pandas DataFrames, the "object" data type is mixed data type or columns where the data type is not specific (e.g., strings, numbers, or other Python objects). 

`.info()` shows a handful of useful information about a `DataFrame` such as: 
* How many entries (rows) there are 
* Whether there are missing values (if a columns non-null value is less than the number of entries, it has missing values)
* The datatypes of each column

In [None]:
car_sales_df.info()

You can also call various statistical and mathematical methods such as `.mean()` or `.sum()` directly on a `DataFrame` or `Series`.

In [None]:
#Invoking mean() function on DataFrame
car_sales_df.mean()

In [None]:
car_sales_df.sum()

In [None]:
car_prices = car_sales_df['Price']
car_prices

In [None]:
type(car_prices)

In [None]:
car_prices.sum()

In [None]:
car_price = pd.Series(car_sales_df['Price'])
# car_prices = car_sales_df['Price']

In [None]:
car_price.sum()



`.columns` will show you all the columns of a `DataFrame`.

In [None]:
car_sales_df.columns

You can save them to a list which you could use later.

In [None]:
car_columns = car_sales_df.columns
car_columns[0]

`.index` will show you the values in a `DataFrame`'s index (the column on the far left).

In [None]:
car_sales_df.index

In [None]:
# show length of data frame
len(car_sales_df)

So even though the length of our `car_sales` dataframe is 10, this means the indexes go from 0-9.