# Pandas for Beginners

Pandas is a Python library used for working with data sets. It allows to analyze big data and makae conclusions based on statistical theories. Pandas can clean messy data sets and make them readable and relevant. Some of it's functions are analyzing, cleaning, exploring and manipulating data. Pandas is very important to relevant data in data science. 

## Importing Pandas

### Source for Pandas located in github repository
https://github.com/pandas-dev/pandas


In [1]:
# importing pandas
import pandas
pandas.__version__

'1.5.3'

 Pandas is usually imported under the pd alias.
 The Pandas package can be referred to as pd instead of pandas.

In [5]:
# using "pd" alias to import pandas

import pandas as pd

dataset = {
  'food': ["sushi", "pasta", "chips"],
  'price': [10, 14, 6]
}

a = pd.DataFrame(dataset)

print(a)


    food  price
0  sushi     10
1  pasta     14
2  chips      6


## Loading Data with Pandas

### CSV Files
Let's say I have a CSV file I want to load using the Pandas built-in function, **read_csv**

In [None]:
import pandas
csv_path = 'file1.csv'
df = pandas.read_csv(csv_path)

This statement is better since it can shorten by using the standard abbreviation, **pd.**

In [None]:
import pandas as pd
csv_path = 'file1.csv'        # stores the path of the csv which can be used as an argument
df = pd.read_csv(csv_path)    # stored in the variable df
df.head()                     # to examine the first rows of the data frame

The process for loading an Excel file is similar. We use the path of the Excel file. The function reads Excel. The result is a data frame. 

In [None]:
xlsx_path = 'file1.xlsx'
df = pd.read_excel(slsx_path)
df.head()

In [40]:
# sample 1

import pandas as pd
songs = {"Album":["Thriller", "The Body Guard", "Back in Black"],
         "Realeased":[1982, 1992, 1980],    
}
songs_frame = pd.DataFrame(songs)
print(songs_frame)

            Album  Realeased
0        Thriller       1982
1  The Body Guard       1992
2   Back in Black       1980


## Data Frames

A DataFrame has rows and columns, often made from a dictionary where keys become column labels and values turn into rows. It can be converted from dictionary to a DataFrame using the data frame function.

In [None]:
# sample 2

import pandas as pd
  
grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"]
}

grocery_frame = pd.DataFrame(grocery_table)
print(grocery_frame)

### Adding new columns in data frame

Then, to get a new DataFrame with just one column, put the DataFrame name and the column header in double brackets. variable = dataframe[["column name"]]

Same goes for getting multiple columns - just enclose the DataFrame name and column headers in double brackets to make a new DataFrame with those columns.


In [None]:

import pandas as pd

grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
                "location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"]
}
grocery_frame = pd.DataFrame(grocery_table)
x = grocery_frame[["location"]]    # add a new column in the data frame
print(grocery_frame)


### Working with and saving data from data frame

The **unique()** function in pandas is used to extract unique elements from a pandas Series or a DataFrame column.

In [42]:
import pandas as pd

grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
                "location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"],
                "price":[130, 1000, 250, 70, 4000, 120, 40]
}
grocery_frame = pd.DataFrame(grocery_table)    # name of the data frame
x = grocery_frame[["location"]]
print(grocery_frame["location"].unique())      # takes all unique elememts


['Nacka' 'Hässleby' 'TC' 'Täby' 'Ropsten']


#### to_csv() method
I created a DataFrame called grocery_frame from the dictionary grocery_table. Then, I printed a boolean Series indicating whether the price of each item in the "price" column is greater than or equal to 1000. I filtered grocery_frame based on items with prices greater than or equal to 1000 and assigning the result to a new DataFrame called df1. Finally, I used the **to_csv()** method to save the contents of df1 to a CSV file named "items_over_1000.csv".

In [41]:
import pandas as pd

grocery_table = {"days":["Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"],
                "items":["pizza kit","tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"],
                "store":["ICA", "LIDL", "Hemköp", "Willys", "ICA", "COOP", "Hemköp"],
                "location":["Nacka", "Hässleby", "TC", "Täby", "Nacka", "TC", "Ropsten"],
                "price":[130, 1000, 250, 70, 4000, 120, 40]
}
grocery_frame = pd.DataFrame(grocery_table)    # name of the data frame
print(grocery_frame["price"]>= 1000)             # Boolean to check which items cost more than 1000

df1 = grocery_frame[grocery_frame["price"]>=1000]   # create df from grocery_frame
df1.to_csv("items_over_1000")


0    False
1     True
2    False
3    False
4     True
5    False
6    False
Name: price, dtype: bool


## Series and Labels
**Series** is a column in a table, It is one-dimensional array that holds any data type.

**Labels** can be made with the *index* argument

### Lists
Example: Create a weekly budget table using Pandas Series using lists


In [7]:
import pandas as pd

budget_list = [33, 45, 25, 10, 90, 23, 4]

sample_a = pd.Series(budget_list, index = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

# Using string formatting to add the dollar sign
sample_a_with_dollar = sample_a.map("${:.2f}".format)

print(sample_a_with_dollar)

Monday       $33.00
Tuesday      $45.00
Wednesday    $25.00
Thursday     $10.00
Friday       $90.00
Saturday     $23.00
Sunday        $4.00
dtype: object


### Dictionaries
The **keys** of the dictionaries becomes the labels instead.

Example: Create a weekly budget table using Pandas Series using dictionaries

If I have a dictionary and I want to apply formatting to its values, I can achieve this by iterating over the dictionary and formatting each value individually.

In [9]:
import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}

# Using a dictionary comprehension to format each value with a dollar sign
budget_with_dollar = {day: "${:.2f}".format(value) for day, value in budget_dict.items()}

print(budget_with_dollar)

{'Monday': '$33.00', 'Tuesday': '$45.00', 'Wednesday': '$25.00', 'Thursday': '$10.00', 'Friday': '$90.00', 'Saturday': '$23.00', 'Sunday': '$4.00'}


Using the **index** argument in dictionaries to print oEnly specific items.

In [11]:
import pandas as pd
budget_dict = {"Monday": 33, "Tuesday": 45, "Wednesday": 25, "Thursday": 10, "Friday": 90, "Saturday": 23, "Sunday":4}

# Using a dictionary comprehension to format each value with a dollar sign
budget_with_dollar = {day: "${:.2f}".format(value) for day, value in budget_dict.items()}

budget_specific_days = pd.Series(budget_with_dollar, index = ["Saturday", "Sunday"])

print(budget_specific_days)

Saturday    $23.00
Sunday       $4.00
dtype: object


## Data Frames
Data Frames are multi-dimensional tables. Data Frames represents a whole table while Series are like columns.

**Syntax:** variable = pd.DataFrame(data)

### Data Frame using lists
Example: Upgrade my budget table by adding another series for the food bought using lists and dictionaries.

In [25]:
# Data Frame using lists

import pandas as pd

# create lists for each category 
expenses = [33, 45, 25, 10, 90, 23, 4]
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
items = ["pizza kit", "tuna and salmon", "vegetables", "pre-made dinner", "wagyu", "greek salad", "oatmeal"]


# Create pandas Series (column) for items
items_series = pd.Series(items, index = days)

# Create pandas Series(column) for expenses
expenses_series = pd.Series(expenses, index = days)

# Concatenate the two Series to create a DataFrame
budget_df = pd.DataFrame({"Budget" : expenses_series, "Items" : items_series})

print(budget_df)

           Budget            Items
Monday         33        pizza kit
Tuesday        45  tuna and salmon
Wednesday      25       vegetables
Thursday       10  pre-made dinner
Friday         90            wagyu
Saturday       23      greek salad
Sunday          4          oatmeal


### Data Frame using dictionaries
Example: Upgrade my budget table by creating a DataFrame using dictionaries. I just need to provide a dictionary where the keys are the columns and the values are corresponding to the data in each column.

In [36]:
# DataFrames using Dictionaries
# start with this simple code as a guide

import pandas as pd

data = {
    "numbers" : [1, 2, 3],         #int
    "letters" : ["a", "b" ,"c"]    # str
}

# load data into a DataFrame object
sample = pd.DataFrame(data)
print(data)

{'numbers': [1, 2, 3], 'letters': ['a', 'b', 'c']}


In [37]:
import pandas as pd

# create dictionary day:expenses pair
expenses_dict = {
    "Monday":33,
    "Tuesday":45,
    "Wednesday":25,
    "Thursday":10,
    "Friday":90,
    "Saturday":23,
    "Sunday":4
}


# create dictionary day:items pair
items_dict = {
    "Monday":"pizza kit",
    "Tuesday":"tuna and salmon",
    "Wednesday":"vegetables",
    "Thursday":"pre-made dinner",
    "Friday":"wagyu",
    "Saturday":"greek salad",
    "Sunday":"oatmeal"
}

# create pandas series (columns) from dictionaries
expenses_series = pd.Series(expenses_dict)
items_series = pd.Series(items_dict)

# create DataFrame by concatenation of 2 series
budget_df = pd.DataFrame({"Budget": expenses_series, "Items":items_series})

print(budget_df)


           Budget            Items
Monday         33        pizza kit
Tuesday        45  tuna and salmon
Wednesday      25       vegetables
Thursday       10  pre-made dinner
Friday         90            wagyu
Saturday       23      greek salad
Sunday          4          oatmeal
