# Programming in Data Science Project: Invoices Dataset
**Students: Leo WINTER, Yoann SUBLET, Kellian VERVAELE KLEIN, Alvaro SERERO**

Dataset: Invoices
Kaggle link: https://www.kaggle.com/datasets/cankatsrc/invoices/data

This dataset includes multiple fields such as customer details (first name, last name, email), transaction information (product ID, quantity, amount, invoice date), and additional attributes like address, city, and stock code.

## 1) Data collection and exploration

In [6]:
# Load the data into a pandas DataFrame
import pandas as pd

def load_data(file_path: str) -> pd.DataFrame:
    df = pd.read_csv(file_path)
    return df

def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    if 'invoice_date' in df.columns:
        # Convert 'invoice_date' to datetime format
        df['invoice_date'] = pd.to_datetime(df['invoice_date'], format='%d/%m/%Y', errors='coerce')

    if 'qty' in df.columns and 'amount' in df.columns:
        # Create 'revenue' column as product of 'quantity' and 'amount'
        df['revenue'] = df['qty'] * df['amount']
    
    return df

def explore_data(df: pd.DataFrame) -> pd.DataFrame:
    print("Shape (rows, columns):", df.shape)
    print("\nColumn dtypes:")
    print(df.dtypes)

    print("\nMissing values per column:")
    print(df.isna().sum())

    print("\nBasic description of numerical columns:")
    print(df.describe())

    # Correlation matrix for numeric variables
    print("\nCorrelation matrix (numeric columns):")
    print(df[['qty', 'amount', 'revenue']].corr())
    return df

df = load_data('invoices.csv')
df = preprocess_data(df)
df = explore_data(df)

Shape (rows, columns): (10000, 12)

Column dtypes:
first_name              object
last_name               object
email                   object
product_id               int64
qty                      int64
amount                 float64
invoice_date    datetime64[ns]
address                 object
city                    object
stock_code               int64
job                     object
revenue                float64
dtype: object

Missing values per column:
first_name      0
last_name       0
email           0
product_id      0
qty             0
amount          0
invoice_date    0
address         0
city            0
stock_code      0
job             0
revenue         0
dtype: int64

Basic description of numerical columns:
         product_id           qty        amount                invoice_date  \
count  10000.000000  10000.000000  10000.000000                       10000   
mean     149.746700      5.005900     52.918236  1995-06-11 13:32:18.240000   
min      100.000000      1.0

We can see that there is no missing or NaN data since all columns have 10000 non-null rows.

## 2) Querying the dataset

Maximal number transaction amount made by a specific user.

In [7]:
def indicator_top_cities(df: pd.DataFrame, n: int = 10) -> pd.DataFrame:
    top_revenue_cities = df.groupby('city', as_index=False)['revenue'].sum().head(n)
    return top_revenue_cities

indicator_top_cities(df, 5)

Unnamed: 0,city,revenue
0,Aaronburgh,555.54
1,Aaronfurt,176.13
2,Aaronland,65.56
3,Aaronport,400.55
4,Aaronshire,732.24


In [8]:
def monthly_revenue(df: pd.DataFrame) -> pd.DataFrame:
    """Get monthly revenue."""
    monthly_revenue_df = df.set_index('invoice_date').resample('M')['revenue'].sum().to_frame('monthly_revenue')
    return monthly_revenue_df

monthly_revenue(df)

  monthly_revenue_df = df.set_index('invoice_date').resample('M')['revenue'].sum().to_frame('monthly_revenue')


Unnamed: 0_level_0,monthly_revenue
invoice_date,Unnamed: 1_level_1
1970-01-31,3878.93
1970-02-28,6050.15
1970-03-31,5527.01
1970-04-30,2050.53
1970-05-31,2779.05
...,...
2021-09-30,5605.31
2021-10-31,3333.89
2021-11-30,5293.00
2021-12-31,3466.81


In [None]:
# Main block
if __name__ == "__main__":
    # Step 1: Data collection
    filepath = 'invoices.csv'
    df = load_data(filepath)
    df = preprocess_data(df)
    df = explore_data(df)
    # Step 2: Querying the dataset

NameError: name 'explore_' is not defined