# Coffee Sales Analysis

![coffee_image](../assets/coffee-image.jpeg "alt-coffee_image")

## About Author

Author: Joshua Farara

Project: title

### Contact Info
Click on link below to contact/follow/correct me:

Email: joshua.farara@gmail.com

[LinkedIn](https://www.linkedin.com/in/joshuafarara/)

[Facebook](https://www.facebook.com/josh.farara/)

[Twitter](https://x.com/FararaTheArtist)

[Github](https://github.com/JoshuaFarara)


## Import Libraries

We will use the following libraries¶
1. Pandas: Data manipulation and analysis
2. Numpy: Numerical operations and calculations
3. Matplotlib: Data visualization and plotting
4. Seaborn: Enhanced data visualization and statistical graphics
5. Scipy: Scientific computing and advanced mathematical operations

In [69]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Loading and Exploration | Cleaning

### Load a CSV file then creating a dataframe

In [70]:
# Kaggle Notebook
# df = pd.read_csv('/kaggle/input/coffee-sales/index.csv')


#Local Machine Notebook
df = pd.read_csv('data/coffee_sales_data.csv')

### Set the option to show maximum columns

In [71]:
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)

### Get a sneak peek of data
The purpose of a sneak peek is to get a quick overview of the data and identify any potential problems or areas of interest

In [72]:
df.head(5)

Unnamed: 0,date,datetime,cash_type,card,money,coffee_name
0,2024-03-01,2024-03-01 10:15:50.520,card,ANON-0000-0000-0001,38.7,Latte
1,2024-03-01,2024-03-01 12:19:22.539,card,ANON-0000-0000-0002,38.7,Hot Chocolate
2,2024-03-01,2024-03-01 12:20:18.089,card,ANON-0000-0000-0002,38.7,Hot Chocolate
3,2024-03-01,2024-03-01 13:46:33.006,card,ANON-0000-0000-0003,28.9,Americano
4,2024-03-01,2024-03-01 13:48:14.626,card,ANON-0000-0000-0004,38.7,Latte


### Let's see the column names

In [73]:
df.columns

Index(['date', 'datetime', 'cash_type', 'card', 'money', 'coffee_name'], dtype='object')

### Let's have a look on the shape of the dataset

In [74]:
print(f"The Number of Rows are {df.shape[0]}, and columns are {df.shape[1]}.")

The Number of Rows are 896, and columns are 6.


### Let's have a look on the columns and their data types using detailed info function

In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 896 entries, 0 to 895
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         896 non-null    object 
 1   datetime     896 non-null    object 
 2   cash_type    896 non-null    object 
 3   card         807 non-null    object 
 4   money        896 non-null    float64
 5   coffee_name  896 non-null    object 
dtypes: float64(1), object(5)
memory usage: 42.1+ KB


### Count the missing values

In [76]:
df.isnull().sum()

date            0
datetime        0
cash_type       0
card           89
money           0
coffee_name     0
dtype: int64

## Observation Set 1

* There are 896 rows, and 6 columns in the dataset.

* The data type of all columns are objects except for df['money'] which is float.

* The columns in the datasets are:
    * 'date', 'datetime', 'cash_type', 'card', 'money', 'coffee_name'
    
* There are a few missing values in the dataset, which we will read in detail and deal with later on in the notebook.

* rename columns 'cash_type':'payment_type','card':'card_number', 'money':'amount_paid_usd'

* datetime can be split into two columns and drop coulumn time, already have a column that satisfies date

* money column can be converted to move decimal one place left




### Task:

Clean the data by changing column names 


1. Change column names to appropriate names matching the data.
2. 

#### Changing column names 

Changed: cash_type to payment_type since cash and card payments are accepted

In [77]:
df.rename(columns={'cash_type':'payment_type', 'card':'card_number', 'money':'amount_paid_usd'}, inplace=True)

In [78]:
df.columns

Index(['date', 'datetime', 'payment_type', 'card_number', 'amount_paid_usd',
       'coffee_name'],
      dtype='object')

## Changing Column Names

In [79]:
df[['new_date', 'time']] =df['datetime'].str.split(' ', n=1, expand=True)
df = df.drop(['new_date', 'datetime'], axis=1)
# df = ['date', 'time', 'payment_type', 'card_number', 'amount_paid_usd', 'coffee_name']

In [1]:
# check the data
df.head()


NameError: name 'df' is not defined

In [81]:
df['time']

0      10:15:50.520
1      12:19:22.539
2      12:20:18.089
3      13:46:33.006
4      13:48:14.626
5      15:39:47.726
6      16:19:02.756
7      18:39:03.580
8      19:22:01.762
9      19:23:15.887
10     19:29:17.391
11     10:22:06.957
12     10:30:35.668
13     10:41:41.249
14     11:59:45.484
15     14:38:35.535
16     16:37:24.475
17     17:34:54.969
18     10:10:43.981
19     10:27:18.561
20     11:33:56.118
21     12:26:56.098
22     13:09:36.321
23     17:06:40.271
24     17:08:45.895
25     18:03:23.369
26     18:04:27.946
27     18:08:04.959
28     10:03:51.994
29     10:54:50.958
30     11:05:16.184
31     14:04:37.734
32     09:59:52.651
33     14:34:55.963
34     17:34:06.043
35     17:35:25.021
36     17:36:28.571
37     17:37:13.659
38     17:38:09.354
39     17:56:15.776
40     18:01:31.242
41     12:30:27.089
42     13:24:07.667
43     13:25:14.351
44     14:52:01.761
45     14:53:18.344
46     10:08:58.945
47     10:18:40.543
48     11:03:58.976
49     11:25:43.977


###