# <center> DSC 350 - Week 5 - Exercise 5
***
## Alana D'Agostino
### Professor Kinney
Textbook Reference: __[Hands-On Data Analysis with Pandas (2nd Ed.) - Ch. 3](https://github.com/AlanaDAg/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_03)__ <br>
Textbook Data Directory: __[Chapter 3 Exercises Data Directory (GitHub)](https://github.com/AlanaDAg/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_03/exercises)__
***

In [2]:
# Code attribution
/
# ========================================================================================
# Title: "Hands-On Data Analysis with Pandas (Second Edition), Chapter 3
# Author: Stefanie Molin
# Date: 14 April 2024
# Modified By: Alana D'Agostino (DSC 350 - Week 5 - Exercise 5)
# Description: This program follows along with the exercises in Chapter 3 of
# Stefanie Molin's _Hands-On Data Analysis with Panda (2nd Ed.).
# It will perform Data Manipulation including reading in and Combining multiple files
## of one type (CSV), performing basic data cleaning tasks, and reshaping DataFrames
## with Pandas (melt()).
### Data is pulled from the textbook's GitHub Ch 03 directory (under 'exercises').
# ========================================================================================
/

()

# <center> <font color=blue>Chapter 3</font> <br><font color=mediumblue>**Data Wrangling with Pandas**


***
### <center><font color=#00ad43>**Complete the following exercises using what we have learned so far in this book and the data in the `exercises` /directory.**
***

# <font color=mediumblue>1.</font>**Multiple CSV Files:**<br><center>Reading In and Combining

> <font color=deeppink>We want to look at data for the **Facebook, Apple, Amazon, Netflix,** and **Google** (**FAANG**) stocks, but we were given each as a separate CSV file.</font>
> > <font color=deeppink>**Combine** the CSV files into a single file
> > <font color=deeppink>Then **store** the dataframe of the FAANG data as `faang` for the rest of the exercises:
> > > * <font color=deeppink>**Read in** the `aapl.csv`, `amzn.csv`, `fb.csv`, `goog.csv`, and `nflx.csv` files
> > > * <font color=deeppink>**Add a column** to each dataframe, called `ticker`, indicating the ticker symbol it is for; this is how you look up a stock. *In this case, the filenames happen to be the ticker symbols.*
> > > * <font color=deeppink>**Append** them together into a single dataframe.
> > > * <font color=deeppink>**Save** the result in a CSV file called `faang.csv`

In [3]:
# Import libraries
import pandas as pd

# Combine separate CSV files into a single DataFrame `faang`
## Use for-loop to cycle through all the CSV files as a list
faang = pd.DataFrame()
for ticker in ['fb', 'aapl', 'amzn', 'nflx', 'goog']:
    # You can combine raw and f-strings
    filepath = (rf'C:\Users\alana\OneDrive\Desktop\DSC 350\Data\{ticker}.csv')
    df_faang = pd.read_csv(filepath)
    # Add `ticker` column to dataframes
    df_faang.insert(0, 'ticker', ticker.upper())
    # .append() was deprecated; Use pd.concat() instead
    faang = pd.concat([faang, df_faang])

# Save the new combined DataFrame as a CSV file
faang.to_csv('faang.csv', index=False)

# Inspect the first rows of new faang DataFrame
faang.head(5)

Unnamed: 0,ticker,date,high,low,open,close,volume
0,FB,2018-01-02,181.580002,177.550003,177.679993,181.419998,18151900.0
1,FB,2018-01-03,184.779999,181.330002,181.880005,184.669998,16886600.0
2,FB,2018-01-04,186.210007,184.100006,184.899994,184.330002,13880900.0
3,FB,2018-01-05,186.899994,184.929993,185.589996,186.850006,13574500.0
4,FB,2018-01-08,188.899994,186.330002,187.199997,188.279999,17994700.0


> **Sources:**
> * Combining raw and f-string literal: __[Stack Overflow](https://stackoverflow.com/questions/58302531/combine-f-string-and-raw-string-literal)__<br>
> * Pandas `.concat()`: __[pydata.org](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)__ & __[GeeksforGeeks](https://www.geeksforgeeks.org/pandas-concat-function-in-python/)__

***
# <font color=mediumblue>2.</font> **Data Cleaning:** <br><center>Data Type Conversions
> <font color=deeppink>With `faang`, use **type conversion** to:
> > * <font color=deeppink>**Cast** the values of the `date` column into ***datetimes***
> > * <font color=deeppink>**Cast** the values of the `volume` column into ***integers***

In [4]:
# Inspect data types for each feature (column)
print("Feature column data types:\n", faang.dtypes)

print("\nCast `date` column from object to datetimes." + 
      "\nCast `volume` from float64 to integers.")

Feature column data types:
 ticker     object
date       object
high      float64
low       float64
open      float64
close     float64
volume    float64
dtype: object

Cast `date` column from object to datetimes.
Cast `volume` from float64 to integers.


In [5]:
# 
faang = faang.assign(
    # Lambda functions (Ch.3, pg.144)
    # pd.to_datetime() function (Ch.3, pg.141, Molin)
    date=lambda x: pd.to_datetime(x.date),
    # .astype() method converts a single column at a time (Ch.3,pg.144, Molin)
    volume=lambda x: x.volume.astype(int)
# Sort the values by `date` and `ticker` features to see first(0) rows of each faang stock
).sort_values(['date', 'ticker'])

# Inspect the first rows of the dataframe
faang.head(5)

Unnamed: 0,ticker,date,high,low,open,close,volume
0,AAPL,2018-01-02,43.075001,42.314999,42.540001,43.064999,102223600
0,AMZN,2018-01-02,1190.0,1170.51001,1172.0,1189.01001,2694500
0,FB,2018-01-02,181.580002,177.550003,177.679993,181.419998,18151900
0,GOOG,2018-01-02,1066.939941,1045.22998,1048.339966,1065.0,1237600
0,NFLX,2018-01-02,201.649994,195.419998,196.100006,201.070007,10966900


In [6]:
# Confirm the data type conversions
print("Data types after conversion:\n", faang.dtypes)

Data types after conversion:
 ticker            object
date      datetime64[ns]
high             float64
low              float64
open             float64
close            float64
volume             int32
dtype: object


> **NOTE:** Type conversion (Ch.3, pgs. 140-146, Molin)

***
# <font color=mediumblue>3.</font> **Sorting:** <br><center>Subsets of Values
> <font color=deeppink>Find the **seven** (**7**) rows in `faang` with the lowest value for `volume`.

In [7]:
# Use nsmallest() method to grab n smallest rows (Ch.3, pg.148, Molin)
faang.nsmallest(7, 'volume')

Unnamed: 0,ticker,date,high,low,open,close,volume
126,GOOG,2018-07-03,1135.819946,1100.02002,1135.819946,1102.890015,679000
226,GOOG,2018-11-23,1037.589966,1022.398987,1030.0,1023.880005,691500
99,GOOG,2018-05-24,1080.469971,1066.150024,1079.0,1079.23999,766800
130,GOOG,2018-07-10,1159.589966,1149.589966,1156.97998,1152.839966,798400
152,GOOG,2018-08-09,1255.541992,1246.01001,1249.900024,1249.099976,848600
159,GOOG,2018-08-20,1211.0,1194.625977,1205.02002,1207.77002,870800
161,GOOG,2018-08-22,1211.839966,1199.0,1200.0,1207.329956,887400


> **NOTE:** The `nsmallest()` method returns a dictated number of rows (*n* argument), after sorting the DataFrame from smallest to largest values for a specified column.
> > Places the smallest values at the top of the DF.

> The `nlargest()` method performs a similar manipulation, only it sorts the DF by the largest to smallest value for a specified column.

> **Syntax:** <center>`nsmallest()` = *dataframe*.nsmallest(*n*, *columns*, keep)<br>
`nlargest()` = *dataframe*.nlargest(*n*, *columns*, keep)

**Sources:**
* __[nsmallest()](https://www.w3schools.com/python/pandas/ref_df_nsmallest.asp)__ & __[nlargest()](https://www.w3schools.com/python/pandas/ref_df_nlargest.asp)__ - W3Schools

***
# <font color=mediumblue>4.</font> **Reshaping Data:** Melting DataFrames <br><center>Long and Wide Formats
> <font color=deeppink>Right now, the data is somewhere between long and wide format.
> > <font color=deeppink>Use `melt()` to make it completely long format.

<font color=black><center>**HINT:** <font color=cherryblossom> &diams;&diams;`date` and `ticker` are our ID variables (they uniquely identify each row).</center>

> > <font color=deeppink>We need to melt the rest so that we don't have separate columns for `open`, `high`, `low`, `close`, and `volume`.

In [8]:
# Melting tranforms data frome wide to long format
## .melt() unpivots a DataFrame from wide to long (undoes a pivot)
melted_faang = faang.melt(
    # id_vars and value_vars arguments = lists
    id_vars=['ticker', 'date'],
    value_vars=['open', 'high', 'low', 'close', 'volume'])

# There will now be two new columns (`variable` and `value`) that will replace
## the `open`, `high`, `low`, `close`, and `volume` columns

# Inspect the first rows of the melted DataFrame
melted_faang.head(8)

Unnamed: 0,ticker,date,variable,value
0,AAPL,2018-01-02,open,42.540001
1,AMZN,2018-01-02,open,1172.0
2,FB,2018-01-02,open,177.679993
3,GOOG,2018-01-02,open,1048.339966
4,NFLX,2018-01-02,open,196.100006
5,AAPL,2018-01-03,open,43.1325
6,AMZN,2018-01-03,open,1188.300049
7,FB,2018-01-03,open,181.880005


> **NOTE:** The `.melt()` function massages a DataFrame where one or more feature columns are identifier (ID) variables, while all other columns (measured variables (value_vars)) are unpivoted to the row axis, which leaves two(2) non-ID columns - 'variable' and 'value'.
> > Arguments of interest: <br> **id_vars**: *scalar, tuple, list, or ndarray* - optional <br> **value_vars**: *scalar, tuple, list, or ndarray* - optional

> **Note:** Melting DataFrames (Ch.3, pgs.169-172, Molin) <br>
> **Source:** Pandas `.melt()`: __[pydata.org](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)__

***
# <font color=mediumblue>5.</font> **Handling Data:** <br><center>Duplicate, Missing, or **Invalid** Data
> <font color=deeppink>Suppose we found out that on July 26, 2018, there was a glitch in how the data was recorded.
> > <font color=deeppink>**How should we handle this?**
<font color=black><center>**NOTE:** <font color=cherryblossom>There is no coding required for this exercise.</center>

In [9]:
print("There are several ways to handle invalid data and the best option is the one " +
      "that has the least negative impact on the quality of the overall dataset. " +
      "It is important to remember that dropping even a few rows from a large dataset " +
      "can make a significant impact, depending on the context and data source. " +
      "The same impact can result from interpolating values in a few rows, as well. " +
      "\nWe should consider the number of invalid values, their locations within the " +
      "dataset, and any other potential sources to locate the correct values. " +
      "Using the `fillna()` method here is one of the better options.")

There are several ways to handle invalid data and the best option is the one that has the least negative impact on the quality of the overall dataset. It is important to remember that dropping even a few rows from a large dataset can make a significant impact, depending on the context and data source. The same impact can result from interpolating values in a few rows, as well. 
We should consider the number of invalid values, their locations within the dataset, and any other potential sources to locate the correct values. Using the `fillna()` method here is one of the better options.


> **NOTE:** Handling duplicate, missing, or invalid data (Ch.3, pgs.172-188, Molin) <br>Mitigating the Issues (pgs.179-188)