In [19]:
import pandas as pd
import numpy as np
import os

#### Merging 12 months of sales data into a single file

Below is how you read a single csv file into the notebook using Pandas:

In [3]:
df = pd.read_csv('./Sales_Data/Sales_April_2019.csv')

df.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,176558.0,USB-C Charging Cable,2.0,11.95,04/19/19 08:46,"917 1st St, Dallas, TX 75001"
1,,,,,,
2,176559.0,Bose SoundSport Headphones,1.0,99.99,04/07/19 22:30,"682 Chestnut St, Boston, MA 02215"
3,176560.0,Google Phone,1.0,600.0,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"
4,176560.0,Wired Headphones,1.0,11.99,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"


But what if you wanted to save time and read all csv files into the notebook? Do you have to have 12 lines of pd.read_csv?

There is almost always an easier, more succinct way to perform a task. Don't be afraid to Google something to find a shorter, better way.

In [4]:
files =  [file for file in os.listdir('./Sales_Data')]

for file in files:
    print(file)

Sales_April_2019.csv
Sales_August_2019.csv
Sales_December_2019.csv
Sales_February_2019.csv
Sales_January_2019.csv
Sales_July_2019.csv
Sales_June_2019.csv
Sales_March_2019.csv
Sales_May_2019.csv
Sales_November_2019.csv
Sales_October_2019.csv
Sales_September_2019.csv


Now that we have all our files, we need to determine how to merge (or concatenate) them into a single .csv file

In [8]:
# define an empty df to store our data:
all_months_data = pd.DataFrame()

files =  [file for file in os.listdir('./Sales_Data')]
for file in files:
    df = pd.read_csv('./Sales_Data/' + file)
    all_months_data = pd.concat([all_months_data, df])

all_months_data.shape

(186850, 6)

In [9]:
all_months_data.to_csv('all_data.csv', index = False)

#### Read in updated dataFrame

In [10]:
all_data = pd.read_csv('all_data.csv')
all_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,176558.0,USB-C Charging Cable,2.0,11.95,04/19/19 08:46,"917 1st St, Dallas, TX 75001"
1,,,,,,
2,176559.0,Bose SoundSport Headphones,1.0,99.99,04/07/19 22:30,"682 Chestnut St, Boston, MA 02215"
3,176560.0,Google Phone,1.0,600.0,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"
4,176560.0,Wired Headphones,1.0,11.99,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"


##### **QUESTION 1: What was the best month for sales? How much was earned in that month?**

In [11]:
all_data.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

To find the total money earned for each month, we must calculate the total income received for each sale, then add it up for each month.

We can do this as follows:
- earned = price each * quantity of item [Done]
- sum of earned for each month
- max(sum_earned) = best month

Problems encountered:
- There exist some values that are NaN
- There exist columns that are just the column names reiterated for the sake of readability. This causes some calculations to be unable to be performed, resulting in errors.
- The digits shown in the columns are actually strings, not floats or ints

In [229]:
df = pd.DataFrame([all_data['Quantity Ordered'], all_data['Price Each'], 
                   all_data['Order Date']])
df = np.transpose(df)

In [230]:
df.head()

Unnamed: 0,Quantity Ordered,Price Each,Order Date
0,2.0,11.95,04/19/19 08:46
1,,,
2,1.0,99.99,04/07/19 22:30
3,1.0,600.0,04/12/19 14:38
4,1.0,11.99,04/12/19 14:38


Next, we must drop the rows that only contain the column heading repeated for readability:

In [232]:
for i in range(len(df['Quantity Ordered'])):
    if df['Quantity Ordered'][i] == 'Quantity Ordered':
        df.drop(i, inplace=True)

The above removes all entries in the DataFrame that are just the names of the column headers. This is to allow for the necessary multiplication for each month to be had.

In [233]:
df.head()

Unnamed: 0,Quantity Ordered,Price Each,Order Date
0,2.0,11.95,04/19/19 08:46
1,,,
2,1.0,99.99,04/07/19 22:30
3,1.0,600.0,04/12/19 14:38
4,1.0,11.99,04/12/19 14:38


Since the rows that show up as NaN are just blank rows across all columns for the data set, we can drop them entirely:

In [234]:
df.dropna(inplace=True)

Now that 'Quantity Ordered' contains just digits, we must change them from Str to Floats so that we can perform arithmetic:

In [235]:
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'])

In [236]:
df.head()

Unnamed: 0,Quantity Ordered,Price Each,Order Date
0,2,11.95,04/19/19 08:46
2,1,99.99,04/07/19 22:30
3,1,600.0,04/12/19 14:38
4,1,11.99,04/12/19 14:38
5,1,11.99,04/30/19 09:27


In [237]:
df['Price Each'] = pd.to_numeric(df['Price Each'])

We must now do the same for the 'Price Each' Series. However, since we used .drop(inplace=True) previously, we don't have to repeat the step where we delete the column headings for this Series.

In [238]:
df['Earned'] = df['Quantity Ordered'] * df['Price Each']

In [239]:
df.head()

Unnamed: 0,Quantity Ordered,Price Each,Order Date,Earned
0,2,11.95,04/19/19 08:46,23.9
2,1,99.99,04/07/19 22:30,99.99
3,1,600.0,04/12/19 14:38,600.0
4,1,11.99,04/12/19 14:38,11.99
5,1,11.99,04/30/19 09:27,11.99


In order to more easily navigate the dataFrame, it is important to reset the indexes of the dataFrame so that it doesn't skip values (0, 1, 2... instead of 0, 2, 3...)

In [266]:
df = df.reset_index(drop=True)

Now that we have the total amount earned for each sale, we must determine how much was earned per month. To do this, we must first identify the month that each purchase occured in, then we must sum all values in the 'Earned' column during that time.

Using a comprehension within a series, we're able to cut just the month portion of the order date off to add it to the new Month column.

In [279]:
df['Month'] = pd.Series([df['Order Date'][i][:2] for i in range(len(df['Order Date']))])

In [289]:
df['Month'].unique() #ensure it's only the months and that all months are represented

array(['04', '05', '08', '09', '12', '01', '02', '03', '07', '06', '11',
       '10'], dtype=object)

In [318]:
# [sum(df['Earned'][i]) for i in df['Month'] if i == '04']

print(sum(df['Earned']))
sum(df['Earned'][df['Month'] == '04'])

34492035.969934314


3390670.240000704