The data cleaning steps are based on the following assessment points:

The dataset has 3 years worth of daily sales data extending from January 2013 to October 2015. Each data point represents the sale of an item in a shop on a particular date. There are two million nine hundred thirty-five thousand eight hundred forty-nine observations in 6 features.

There are 21,807 items sold in 60 shops. The items are divided into 84 item categories.

The sale in different shops is certainly not uniform. Some shops sell better than others (Shops with ids from 20 to 35 do very well. The shop_ids less than 20 does worse than shop_ids more than 35).

Looking at the sales per date-block, we can see that the sales have decreased over the years from 2013 to 2015 with a couple of spikes in between. The year 2013 begins with 200K+ items sold and at the end of 2015 we see it almost halved with 100K+ items sold.

The requirement is to predict monthly sales of each item in each shop. 


During the first level assessment, we have identified a few quality issues which need to be rectified. Along with this, we also have some transformation ideas as well that will derive more information from the dataset.

Training Set

Quality Issues

1. Round all float values in the dataframe to two decimal places
2. Remove the row with item_id = 0 
3. Remove 6 duplicate rows
4. There is a row with index 484683 where the item_price is negative
5. Downcast dataframe to save memory
6. Remove rows where items have been returned (7356 rows). These are rows with negative values
7. Remove rows where item_cnt_day > 100. There are about 138 rows

Tidiness Issues

1. Combine sales_train and item_categories with item_id

Add New Features

1. Item_price_class - Divide item prices into 4 classes
2. Split date column - day, year, month,day of the week, weekday/weekend



## 1. Set up

### 1 - Drive

1 - Mount Drive

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


2 - Move to the data folder

In [175]:
cd "gdrive/MyDrive/Projects/1 - Numericals/Predict Future Sales/2 - Production/data"

[Errno 2] No such file or directory: 'gdrive/MyDrive/Projects/1 - Numericals/Predict Future Sales/2 - Production/data'
/content/gdrive/MyDrive/Projects/1 - Numericals/Predict Future Sales/2 - Production/data


### 2. Libraries

In [176]:
# Load data
import pandas as pd
import numpy as np
import io
import os
import glob

# Meta
import time

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sb

# Analysis
from scipy.stats import zscore

## 3. Data

1 - List file names

In [177]:
ls

competitive-data-science-predict-future-sales.zip  sales_train.csv
item_categories.csv                                sample_submission.csv
items.csv                                          shops.csv
kaggle.json                                        test.csv


In [178]:
# Load sales_train
df = pd.read_csv('sales_train.csv')

### 4. Classes

1 - Define the class

A - Data Assessment of one class

In [179]:
# Class that helps the assessment of each table individually
class DataAssessment:

  def __init__(self, df):
    self.df = df


  # Get the basic file information
  def files_basic_info(self):
    '''
    Function - Get the very basic details about the data files
    Input - None
    Action - Find the number of csv files, file size,
            number of rows and number of columns
    Dependencies - 
      import glob
      import os
    '''
    begin = time.time()
    
    # Get all the files in the folder
    files = os.listdir()
    file_list = glob.glob('*.csv')

    # Find the number of files in the folder
    print('The number of files in the directory is:',len([name for name in os.listdir('.') if os.path.isfile(name)]))
    print('\n')
    print('The file names are:')
    print(files)
    print('\n')

    # Get the details of the csv files
    print('The csv files details are:')
    print('\n')
    for i in file_list:
      print('File:',i)
      file_size = os.path.getsize(i)
      converted_size = self.formatFileSize(file_size, 'B', 'MB', precision=0)
      df = pd.read_csv(i,error_bad_lines=False)
      df_shape = df.shape
      n_rows = df_shape[0]
      n_columns = df_shape[1]
      file_size = self.formatFileSize(file_size, 'B', 'MB', precision=2)
      head = df.head()
      print('File size:',file_size,'MB')
      print('Number of data points:',n_rows)
      print('Number of features:',n_columns)
      print('\n')
    end = time.time()
    self.find_time_taken(begin, end)

  
  # Code attribution : https://www.codegrepper.com/code-examples/python/convert+bytes+to+mb+python
  def convertFloatToDecimal(self, f=0.0, precision=2):
      '''
      Function: Convert a float to string of decimal.
      precision: by default 2.
      If no arg provided, return "0.00".
      '''
      return ("%." + str(precision) + "f") % f


  # Code attribution : https://www.codegrepper.com/code-examples/python/convert+bytes+to+mb+python
  def formatFileSize(self, size, sizeIn, sizeOut, precision=0):
    '''
    Function: Convert file size to a string representing its value in B, KB, MB and GB.
    The convention is based on sizeIn as original unit and sizeOut
    as final unit. 
    '''
    assert sizeIn.upper() in {"B", "KB", "MB", "GB"}, "sizeIn type error"
    assert sizeOut.upper() in {"B", "KB", "MB", "GB"}, "sizeOut type error"
    if sizeIn == "B":
        if sizeOut == "KB":
            return self.convertFloatToDecimal((size/1024.0), precision)
        elif sizeOut == "MB":
            return self.convertFloatToDecimal((size/1024.0**2), precision)
        elif sizeOut == "GB":
            return self.convertFloatToDecimal((size/1024.0**3), precision)
    elif sizeIn == "KB":
        if sizeOut == "B":
            return self.convertFloatToDecimal((size*1024.0), precision)
        elif sizeOut == "MB":
            return self.convertFloatToDecimal((size/1024.0), precision)
        elif sizeOut == "GB":
            return self.convertFloatToDecimal((size/1024.0**2), precision)
    elif sizeIn == "MB":
        if sizeOut == "B":
            return self.convertFloatToDecimal((size*1024.0**2), precision)
        elif sizeOut == "KB":
            return self.convertFloatToDecimal((size*1024.0), precision)
        elif sizeOut == "GB":
            return self.convertFloatToDecimal((size/1024.0), precision)
    elif sizeIn == "GB":
        if sizeOut == "B":
            return self.convertFloatToDecimal((size*1024.0**3), precision)
        elif sizeOut == "KB":
            return self.convertFloatToDecimal((size*1024.0**2), precision)
        elif sizeOut == "MB":
            return self.convertFloatToDecimal((size*1024.0), precision)


    # Show the first 10 rows of the dataframe
  def print_head(self, df):
    '''
    Function: Print the first 10 rows of the dataframe 
    Input: A dataframe
    Output: The first 10 rows of the dataframe
    '''
    return df.head(n=10)
  
  # Show a single datapoint vertically
  def print_vertically(self, df):
    '''
    Function: Print the first data point vertically
    Input: A dataframe
    Output: The first data point vertically
    '''
    return df.iloc[0]

  # Find duplicates
  def find_duplicates(self, df):
    '''
    Function - Find and display duplicate columns and rows of the dataframe
    Input - Dataframe
    Output - Display duplicate columns and rows of the dataframe
    '''
    begin = time.time()

    # Find duplicate rows
    row_duplicates = df[df.duplicated()]
    
    # Find duplicate columns
    duplicate_columns = set()
    other_columns = set()
    
    for x in range(df.shape[1]):
      col = df.iloc[:, x]
      for y in range(x + 1, df.shape[1]):
        otherCol = df.iloc[:, y]
        
        if col.equals(otherCol):
          duplicate_columns.add(df.columns.values[y])
          other_columns.add(df.columns.values[x])
    
    dup_columns = (list(duplicate_columns))
    dup_other_columns = (list(other_columns))

    
    columns = 2
    rows = len(dup_columns)
    column_duplicates = []
    column_duplicates = [[0 for i in range(columns)] for j in range(rows)]
    
    for i in range(len(column_duplicates)):
      val_one = dup_columns[i]
      val_two = dup_other_columns[i]
      column_duplicates[0][i] = val_one
      column_duplicates[i][1] = val_two
    print('\nRow duplicates:')
    print(row_duplicates)
    print('\n')
    print('\nColumn duplicates:')
    print(column_duplicates)

    end = time.time()
    time_taken = end - begin


  # Check the data types of the dataframe
  def check_feature_types(self, df):
    '''
    Function: Get basic information on the dataframe 
    Input: Dataframe
    Output: Some basic information on the dataframe 
    '''
    begin = time.time()

    # Get the information from the dataframe

    # Get the shape of the dataframe
    data_shape = df.shape
    # Get the data types
    data_types = df.dtypes
    # Get the number of rows and columns
    n_rows = data_shape[0]
    n_columns = data_shape[1]
    # Find the missing values
    null_values = df.isnull().sum()
    # Find numeric columns and categorical columns
    numeric_columns = df.select_dtypes([np.number]).columns.tolist()
    columns = list(df)
    categorical_columns = []
    for i in columns:
      if i not in numeric_columns:
        categorical_columns.append(i)
      else:
        None
    
    print('\nCategorical columns:',len(categorical_columns))
    print(categorical_columns)
    print('\nNumeric columns:',len(numeric_columns))
    print(numeric_columns)
    print('\n')
    print('The column type information:')
    df.info()
    end = time.time()
    
    # Display the time taken to run this procedure
    end = time.time()
    self.find_time_taken(begin, end)


  # Function to calculate the time taken to run the function that calls it
  def find_time_taken(self, begin, end):
    '''
    Function: Calculate the time taken to run this procedure
    Input: Beginning time, ending time
    Action: Calculate the time taken for a procedure to be run in minutes or seconds
    '''
    time_taken = end - begin

    if time_taken >= 60:
      time_taken = round(time_taken/60,2)
      print("Time taken to run this procedure:",time_taken,"minutes")
    else:
      time_taken = round(time_taken,2)
    print('\n')
    print("Time taken to run this procedure:",time_taken,"seconds")

  
  # Missing values analysis
  def missing_data_basic_analysis(self, df):
    '''
    Function: Provide analysis on the missing values in the dataframe
    Input: Dataframe
    Output: Row-wise and column-wise analysis of NaN values in the dataframe
    '''

    begin = time.time()

    data_shape = df.shape
    n_rows = data_shape[0]
    n_columns = data_shape[1]
    null_values = df.isnull().sum()
    columns_nans = df.isnull().sum()[df.isnull().sum() > 0]
    n_columns_nans = len(columns_nans)
    percent_nan_columns = round((columns_nans/n_rows)*100,2).sort_values()
    percent = percent_nan_columns
    drop_columns = percent.where(percent > 30)
    drop_columns = drop_columns.dropna()
    n_drop_columns = len(drop_columns)

    all_nan_rows = df[(df.T.isnull()).all()]

    nan_count_column = []
    for index, row in df.iterrows():
      nan_count = df.loc[[index]].isna().sum().sum()
      nan_count_column.append(round((nan_count/n_columns)*100,2))
    df['NaN %'] = nan_count_column
    df.sort_values(by=['NaN %'])
    
    print('Column Analysis:')
    print('\nColumns with null values:')
    print('There are',n_columns_nans,'columns with null values which is',round((n_columns_nans/n_columns)*100,0),'percent of the features')
    print('\nColumns with NaNs:')
    print(percent_nan_columns)
    print('\nColumns with NaNs greater than 30%:')
    print(drop_columns)
    print('\nNumber of columns to be dropped:')
    print(n_drop_columns,'needs to be dropped, that is,',round((n_drop_columns/n_columns)*100,0),'% of total features')
    print('\nDescriptive analysis of columns with missing values')
    print(percent_nan_columns.describe())

    print('\nRow Analysis:')
    print('\nThe number of rows with all NaN values are:',all_nan_rows.shape[0])
    df.drop('NaN %', inplace=True, axis=1)
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)

  
  def missing_data_basic_analysis_v2(self, df):
    '''
    Function: Provide analysis on the missing values in the dataframe if the dataframe is large 1M rows or above
    Input: Dataframe
    Output: Row-wise and column-wise analysis of NaN values in the dataframe
    '''
    begin = time.time()
    data_shape = df.shape
    n_rows = data_shape[0]
    n_columns = data_shape[1]
    null_values = df.isnull().sum()
    columns_nans = df.isnull().sum()[df.isnull().sum() > 0]
    n_columns_nans = len(columns_nans)
    percent_nan_columns = round((columns_nans/n_rows)*100,2).sort_values()
    percent = percent_nan_columns
    drop_columns = percent.where(percent > 30)
    drop_columns = drop_columns.dropna()
    n_drop_columns = len(drop_columns)

    all_nan_rows = df[(df.T.isnull()).all()]

    print('Column Analysis:')
    print('\nColumns with null values:')
    print('There are',n_columns_nans,'columns with null values which is',round((n_columns_nans/n_columns)*100,0),'percent of the features')
    print('\nColumns with NaNs:')
    print(percent_nan_columns)
    print('\nColumns with NaNs greater than 30%:')
    print(drop_columns)
    print('\nNumber of columns to be dropped:')
    print(n_drop_columns,'needs to be dropped, that is,',round((n_drop_columns/n_columns)*100,0),'% of total features')
    print('\nDescriptive analysis of columns with missing values')
    print(percent_nan_columns.describe())
    print('\nMissing Values in Rows')
    print(all_nan_rows)
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)

    

  # Analyze the zero values in the dataframe
  def zero_value_analysis(self, df):
    '''
    Function: Get row-wise and column-wise analysis of 0 values in the dataframe
    Input: Dataframe
    Output: Row-wise and column-wise analysis of 0 values in the dataframe
    '''

    begin = time.time()

    nulls = df.eq(0).sum() 
    column_1 = df.columns
    
    column_2 = nulls.tolist()
    
    dataframe_shape = df.shape
    dataframe_rows = dataframe_shape[0]
    dataframe_columns = dataframe_shape[1]
    column_3 = []
    for i in column_2:
      column_3.append((round(i/dataframe_rows,2))*100)

    # Get a table with the column name, number of null values and percentage of null values
    final_data = {'Column Name':column_1, 'Zero Values':column_2, '% of Zeroes':column_3}
    null_values_table = pd.DataFrame(final_data,columns = ['Column Name','Zero Values','% of Zeroes'])
    null_values_table.sort_values(by=['% of Zeroes'], inplace=True, ascending=False)
    #selection = df.loc[df['mylist']==0]
    null_values_table = null_values_table.loc[null_values_table['Zero Values'] > 0]

    print("Columns:")
    print('\n')
    print('Number of columns with 0 values:',null_values_table.shape[0])
    print(null_values_table)

    print('\n')
    print('Rows:')
    print('\n')
    
    zero_rows = df[(df.T == 0).all()]
    n_zero_rows = zero_rows.shape[0]
    print('The number of rows with only zero values is:',n_zero_rows)
    print('The percentage of rows with only zero values is:',round((n_zero_rows/dataframe_rows)*100,2))
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
  

  # Cardinality of features
  def display_cardinality(self, df):
    '''
    Function: Show cardinality of features in the datframe
    Input: A dataframe
    Output: Column names and their cardinality
    '''
    begin = time.time()
    unique_values = df.nunique()
    print(unique_values)
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
  

  # Show all unique values in a single column
  def unique_values(self, df, column_name):
    '''
    Function: Show unique values of the column
    Input: The dataframe and column name in string format
    Output: Unique values of a column
    '''

    begin = time.time()
    unique_values = df[column_name].unique()
    print(unique_values)
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)


  # Show the categorical and numeric features
  def feature_types(self, df):
    '''
    Function: Find the number and names of categorical and numeric features
    Input: Dataframe
    Output: The number and names of categorical and numeric features
    '''
    begin = time.time()

    # Get the information from the dataframe
    numeric_columns = df.select_dtypes([np.number]).columns.tolist()
    columns = list(df)
    categorical_columns = []
    for i in columns:
      if i not in numeric_columns:
        categorical_columns.append(i)
      else:
        None
    
    # Print the information
    print('\nCategorical columns:',len(categorical_columns))
    print(categorical_columns)
    print('\nNumeric columns:',len(numeric_columns))
    print(numeric_columns)
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
  

  # Check spaces in column headers
  def check_spaces(self, df):
    '''
    Functions: Check if column headers have spaces and return names of those that do
    Input: Dataframe
    Output: Names of column headers with spaces
    '''
    begin = time.time()

    list_1 = df.columns
    list_2 = [c.replace(' ', '_') for c in list_1]
    col_names = set(list_1)-set(list_2)
    return list(col_names)
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)


  # Numeric feature descriptive statistics
  def numeric_feature_details(self, df):
    '''
    Function: Display descriptive statistics of all numeric features in the dataframe
    Input: Dataframe
    Output: Descriptive statistics of all numeric features in the dataframe
    '''
    begin = time.time()

    # Get the information from the dataframe

    numeric_columns = df.select_dtypes([np.number]).columns.tolist()
    print('Number of numeric columns:', len(numeric_columns))
    print(numeric_columns)
    print('\n')

    for i in numeric_columns:
      print('Column Name:',i)
      print(df[i].describe())
      print('\n')
      print('median:',df[i].median())
      print('mode:',df[i].mode())
      print('\n')
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)


  def numeric_details_list(self, df, column_list):
    '''
    Function: Find the number and names of categorical and numeric features
    Input: Dataframe
    Output: The number and names of categorical and numeric features
    '''
    begin = time.time()

    # Get the information from the dataframe

    numeric_columns = column_list
    print('Number of numeric columns:', len(numeric_columns))
    print(numeric_columns)
    print('\n')

    for i in numeric_columns:
      print('Column Name:',i)
      print(df[i].describe())
      print('\n')
      print('median:',df[i].median())
      print('mode:',df[i].mode())
      print('\n')
    end = time.time()
    
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)

    #Time taken to run this procedure
    self.find_time_taken(begin, end)

  
  # Certain columns can have a standardized length - check values of such columns
  def check_value_length(self, df, column_name):
    '''
    Function: To find the length of a value in a given column
    Input: The dataframe and name of the column as a variable
    Output: The length of a value in a given column
    '''
    begin = time.time()
    df_copy = df.copy(deep=False)
    df_copy['character_length'] = df_copy[column_name].astype(str).map(len)
    print(df_copy['character_length'].unique())
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
  


  def get_column_lists(self, df):
    '''
    Function: Get the categorical and numerical column names of a dataframe as two lists
    Input: A dataframe
    Output: The categorical and numerical column name lists
    '''
    begin = time.time()

    numeric_columns = df.select_dtypes([np.number]).columns.tolist()
    columns = list(df)
    categorical_columns = []
    for i in columns:
      if i not in numeric_columns:
        categorical_columns.append(i)
      else:
        None
    end = time.time()
    
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
    return numeric_columns, categorical_columns
  

  def convert_columns_string(self, df):
    '''
    Function: Convert all columns in the given dataframe to string type
    Input: A dataframe
    Output: The dataframe with all columns in string type
    '''
    begin = time.time()
    df_copy = df.copy(deep=False)
    for i in df_copy:
        df_copy[i] = df_copy[i].astype(str)
    end = time.time()
  
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
    return df_copy
  

  def check_column_values(self, df, column_list):
    '''
    Function: Check for data quality in each value of each categorical column
    Input: The dataframe and the list of categorical columns
    Output: Presence of spaces, numeric values, decimals, all alphabets, digits, 
            if the value is in lower, upper and title cases
    Dependency: Use convert_columns_string() to convert all columns to string format. 
                Or use get_column_lists() to get numeric and categorical column name lists
    '''
    begin = time.time()
    try:
            print('\n')
            print('Check if all characters in the string are whitespaces. True if space exists, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.isspace()
              print(check.unique())

            print('\n')
            print('Check whether all characters are numeric. True if only numeric, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.isnumeric()
              print(check.unique())

            print('\n')
            print('Check whether all characters are only alphabetic. True if there are only alphabets, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.isalpha()
              print(check.unique())

            print('\n')
            print('Check whether all characters are only digits. True if there are only digits, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.isdigit()
              print(check.unique())

            print('\n')
            print('Check whether all characters are only decimal. True if there are only decimal, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.isdecimal()
              print(check.unique())

            print('\n')
            print('Check whether all characters are only lower. True if there are only lower case, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.islower()
              print(check.unique())

            print('\n')
            print('Check whether all characters are only upper. True if there are only upper case, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.isupper()
              print(check.unique())
            
            print('\n')
            print('Check whether all characters are only upper. True if there are only upper case, else False.')
            for i in column_list:
              print('Column name:',i)
              check = df[i].str.istitle()
              print(check.unique())
          
    except:
      print('\nPlease check the above column type. Only columns of string type are allowed.')
      print('Check if the column is numeric or boolean and remove them from the column_list list.')
    
    end = time.time()

    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)


  def remove_elements_lists(self, main_list, remove_list):
    '''
    Function: Remove a set of elements from a list using another list
    Input: Two lists
    Output: The list where elements in the second lists are removed from the first list
    '''
    begin = time.time()
    final_list = []
    [final_list.append(x) for x in main_list if x not in remove_list]
    end = time.time()
      
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
    return final_list
  

  def replace_nans_zero(self, df, column_list):
    '''
    Function: Replace NaNs in the dataframe columns with 0
    Input: Dataframe, list of columns
    Output: Dataframe with values in specified column list 0
    '''
    begin = time.time()
    df_copy = df.copy(deep=False)
    for i in column_list:
      df_copy[i] = df_copy[i].replace(np.nan, 0)
    end = time.time()
      
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
    
    return df_copy
  

  def scatter_column_values(self, df, x_value, y_value):
    '''
    Function: Scatter plot between two columns in the same dataframe
    Input: Dataframe, x-value, y-value
    Output: Scatter plot
    Dependency: import matplotlib.pyplot as plt
    '''
    begin = time.time()
    df.plot(x=x_value, y=[y_value], kind='scatter')
    print(x_value,'vs.',y_value)
    print('\n')
    plt.figure(figsize=(3, 4))
    plt.show()

    end = time.time()
      
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
  

  def print_outlier_values(self, df, column_name):
    '''
    Function: Find the higher and lower outlier values of a column.
              Get two dataframes with only higher and lower values rows
    Input: Dataframe and a column name
    Outout: Two dataframes with higher and lower outliers respectively
    '''
    begin = time.time()
    quartiles = df[column_name].quantile([0.25,0.5,0.75])
    q1 = quartiles[0.25]
    q3 = quartiles[0.75]
    iqr = q3 - q1
    lower_outlier_filter = q1 - 1.5 * iqr
    higher_outlier_filter = q3 + 1.5 * iqr
    print('Higher outliers:',higher_outlier_filter)
    print('Lower outliers:',lower_outlier_filter)
    end = time.time()
      
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)


  def get_outlier_dfs(self, df, column_name):
    '''
    Function: Find the higher and lower outlier values of a column.
              Get two dataframes with only higher and lower values rows
    Input: Dataframe and a column name
    Outout: Two dataframes with higher and lower outliers respectively
    '''
    begin = time.time()
    quartiles = df[column_name].quantile([0.25,0.5,0.75])
    q1 = quartiles[0.25]
    q3 = quartiles[0.75]
    iqr = q3 - q1
    lower_outlier_filter = q1 - 1.5 * iqr
    higher_outlier_filter = q3 + 1.5 * iqr
    df_higher_outliers = df.loc[(df[column_name] > higher_outlier_filter)]
    df_lower_outliers = df.loc[(df[column_name] < lower_outlier_filter)]
    end = time.time()
      
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
    return df_higher_outliers, df_lower_outliers


  def get_outlier_normal_dist(self, df, column_name):
    '''
    Function: Get higher outlier dataframe based on the given column that is distributed normally
    Input: Dataframe, column name
    Output: Higher outlier dataframe
    Dependency: from scipy.stats import zscore
    '''
    begin = time.time()
    df_copy = df.copy(deep=False)
    new_col = column_name + '_zscore'
    df_copy[new_col] = zscore(df[column_name])
    df_higher_outliers = df_copy.loc[(abs(df_copy[new_col]) > 3)]
    end = time.time()
      
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)
    return df_higher_outliers

  def get_index_list(self, df, series):
    '''
    Function: Get indices of the dataframe where the series value is True 
    Input: The dataframe, a series
    Output: The dataframe with indeices
    '''
    index_list = series.loc[series == True].index.tolist()
    df_index = df.iloc[index_list]
    return df_index
  

  def scatter_column_values(self, df, x_value, y_value):
    '''
    Function: Scatter plot between two columns in the same dataframe
    Input: Dataframe, x-value, y-value
    Output: Scatter plot
    Dependency: matplotlib.pyplot imported as plt
    '''
    begin = time.time()
    df.plot(x=x_value, y=[y_value], kind='scatter')
    print(x_value,'vs.',y_value)
    print('\n')
    plt.figure(figsize=(3, 4))
    plt.show()

    end = time.time()
      
    # Display the time taken to run this procedure
    self.find_time_taken(begin, end)


    def print_outlier_values(self, df, column_name):
      '''
      Function: Find the higher and lower outlier values of a column.
                Get two dataframes with only higher and lower values rows
      Input: Dataframe and a column name
      Outout: Two dataframes with higher and lower outliers respectively
      '''
      begin = time.time()
      quartiles = twitter_archive[column_name].quantile([0.25,0.5,0.75])
      q1 = quartiles[0.25]
      q3 = quartiles[0.75]
      iqr = q3 - q1
      lower_outlier_filter = q1 - 1.5 * iqr
      higher_outlier_filter = q3 + 1.5 * iqr
      print('Higher outliers:',higher_outlier_filter)
      print('Lower outliers:',lower_outlier_filter)
      end = time.time()
        
      # Display the time taken to run this procedure
      self.find_time_taken(begin, end)


    def get_outlier_dfs(self, df, column_name):
      '''
      Function: Find the higher and lower outlier values of a column.
                Get two dataframes with only higher and lower values rows
      Input: Dataframe and a column name
      Outout: Two dataframes with higher and lower outliers respectively
      '''
      begin = time.time()
      quartiles = df[column_name].quantile([0.25,0.5,0.75])
      q1 = quartiles[0.25]
      q3 = quartiles[0.75]
      iqr = q3 - q1
      lower_outlier_filter = q1 - 1.5 * iqr
      higher_outlier_filter = q3 + 1.5 * iqr
      df_higher_outliers = df.loc[(df[column_name] > higher_outlier_filter)]
      df_lower_outliers = df.loc[(df[column_name] < lower_outlier_filter)]
      end = time.time()
        
      # Display the time taken to run this procedure
      self.find_time_taken(begin, end)
      return df_higher_outliers, df_lower_outliers
  

    def get_outlier_normal_dist(self, df, column_name):
      '''
      Function: Get higher outlier dataframe based on the given column that is distributed normally
      Input: Dataframe, column name
      Output: Higher outlier dataframe
      Dependency: from scipy.stats import zscore
      '''
      begin = time.time()
      df_copy = df.copy(deep=False)
      new_col = column_name + '_zscore'
      df_copy[new_col] = zscore(df[column_name])
      df_higher_outliers = df_copy.loc[(abs(df_copy[new_col]) > 3)]
      end = time.time()
        
      # Display the time taken to run this procedure
      self.find_time_taken(begin, end)
      return df_higher_outliers
    

  def check_decimal_places(self, df, column_list):
    '''
    Function: Check the number of decimal places in a given column
    Input: The dataframe, column_list
    Output: The unique number of decimal places found in the columns
    '''
    begin = time.time()
    for i in column_list:
      places = -np.floor(np.log10(df[i]))
      n_unique_places = places.unique()
      print('Column Name:',i)
      print(n_unique_places)
      print('\n')
    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)


  def barchart(self, df, column_name, chart_type):
    '''
    Function: Create bar chart
    Input: The dataframe and column name
    Output: Bar chart showing the frequency
    Dependency: import seaborn as sb
    '''

    begin = time.time()
    if chart_type == 'abs':
      sb.set_theme(style="whitegrid")
      sb.set_color_codes("pastel")
      #pal = ['black', 'grey', 'grey', 'grey']
      pal = "ch:.25"
      sb.set_style(style='white')
      title_value = column_name + ' absolute frequency'
      ax = sb.countplot(data = df, x = column_name, palette=pal).set(title=title_value)
      #order = titanic['class'].value_counts().index

    else:
      title_value = column_name + ' relative frequency'
      proportions_column = round(df[column_name].value_counts()/len(df),3)
      df_1 = pd.DataFrame({'Unique Values':proportions_column.index, 'Percentage':proportions_column.values})
      ax = df_1.plot.bar(x='Unique Values', y='Percentage', title = title_value, legend=False)
      ax.set_xlabel(column_name)
      ax.set_ylabel("Relative Frequency")
      # Hide grids
      ax.grid(False)
      # Hide axes ticks
      ax.set_xticks([])
      ax.set_yticks([])
      #ax.plot(legend=False)
      print('\n')
      # Show proportions
      print('Proportion Values:')
      print(df_1)
      print('\n')
    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)
  

  def get_column_lists(self, df):
    '''
    Function: Get the categorical and numerical column names of a dataframe as two lists
    Input: A dataframe
    Output: The categorical and numerical column name lists
    '''
    begin = time.time()

    numeric_columns = df.select_dtypes([np.number]).columns.tolist()
    columns = list(df)
    categorical_columns = []
    for i in columns:
      if i not in numeric_columns:
        categorical_columns.append(i)
      else:
        None
    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)
    return numeric_columns, categorical_columns
  

  def unique_values_categorical(self, df, column_list):
    '''
    Function: Find the number of unique values in the column lists
    Input: The dataframe and column list
    Output: Column name and the number of unique values
    '''
    begin = time.time()
    for i in column_list:
      print('Column Name:',i)
      unique = df[i].unique()
      print(len(unique))
    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)


  def bin_num_range_selector(self, df, column_name):
    '''
    Function: Find the minimum and maximum values of the column
    Input: Dataframe, column name
    Output: Maximum and minimum values in the dataframe
    '''
    begin = time.time()
    # Find the maximum value in the column
    maximum = df[column_name].max()
    # Find the minimum value in the column
    minimum = df[column_name].min()
    
    print('Maximum:',maximum)
    print('Minimum:',minimum)

    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)
  

  def bin_num_helper(self, upper, lower):
    '''
    Function: Get bin size suggestions as a list
    Input: Upper and lower values. 
            Upper - Slightly higher than the maximum value of the column (51 if max is 50). Must be a whole number
            Lower - Slightly lower than the minimum value of the column (1 if min is 2). Must be a whole number
    Output: A list of bin size suggestions
    '''
    begin = time.time()
    range = upper - lower
    num_range = np.arange(1,20).tolist()

    # Create a list of all values where range is divisible by numbers from 1 to 20
    divisible_num = []
    for i in num_range:
      if range%i == 0:
        divisible_num.append(i)
    print('Bin size suggestions:',divisible_num)
    print('\nOr any bin size from the range 1 to 20')
    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)


  def make_histogram(self, df, column_name, n_bin):
    '''
    Function: Display histogram
    Input: Dataframe, column name of a numeric column and number of bins
    Output: Histogram
    '''
    begin = time.time()
    n_bin = n_bin
    fig = plt.figure(figsize=(15,4))
    ax = plt.gca()
    counts, _, patches = ax.hist(df[column_name], bins=n_bin)
    for count, patch in zip(counts,patches):
        ax.annotate(str(int(count)), xy=(patch.get_x(), patch.get_height()))
    plt.xlabel(column_name+' bins') 
    plt.ylabel('Frequency') 
    plt.title(column_name)
    plt.show()
    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)

  
  def find_substring_item(self, column_vals_list, substring_list):
    '''
    Function: Find the substring in the unique values of the column
    Input: Column values as a list
    Output: Elements of the list that contain the substring
    '''
    begin = time.time()
    for i in substring_list:
      print('\nItems with the substring:',i)
      #print('\n')
      for j in column_vals_list:
        if(i in j):
          print(j)

    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)


  # Check the frequency of the column
  def frequency_col(self, df, column_name):
    '''
    Function: Create a dataframe with frequency of a column in the given dataframe
    Input: The dataframe, column name
    Output: The dataframe with frequencies of the given column name
    '''
    begin = time.time()
    frequency = df[column_name].value_counts()
    end = time.time()
    
    # Display the time taken to run this procedure
      
    self.find_time_taken(begin, end)
    return frequency
  

  def frequency_table_groupby(self, df, column_name):
    '''
    Function: Group the table frequency with a column
    Input: The dataframe, column name
    Output: The dataframe with frequencies of all the columns grouped by a single column
    '''
    begin = time.time()
    frequency_df = df.groupby(column_name).count()
    end = time.time()
    self.find_time_taken(begin, end)
    return df

  def check_positive(self, df, column_name): 
    '''
    Function: Check if a column has only positive values or not
    Input: The dataframe and column name
    Output: Shows if the data frame has positive or negatie values
    '''
    begin = time.time()

    check = df[column_name] > 0
    print(check.unique())
    
    end = time.time()
    self.find_time_taken(begin, end)

  def check_common_values(self, df_1, df_2, column_name_1, column_name_2):
    '''
    Function: Find the number of uncommon elements between two lists
    Input: Dataframe 1, Dataframe 2,column name of dataframe 1, column name of dataframe 2
    Output: the number of uncommon elements between two lists
    '''
    begin = time.time()
    list1 = df_1[column_name_1].unique()
    list2 = df_2[column_name_2].unique()
    list_difference_1 = [item for item in list1 if item not in list2]
    list_difference_2 = [item for item in list2 if item not in list1]
    print('The number of values in 1st dataframe but not in 2nd dataframe are,',len(list_difference_1))
    print('The number of values in 2nd dataframe but not in 1st dataframe are,',len(list_difference_2))
    end = time.time()
    self.find_time_taken(begin, end)


  def get_true_values(self, df, column_name, check_condition):
    '''
    Function: Check for a string condition in the values of the given column name 
    Input: Dataframe, column name, conditions - isspace, isnumeric, isalpha, isdigit, islower, isupper, istitle
    Output: Presence of spaces, numeric values, decimals, all alphabets, digits,
            if the value is in lower, upper and title cases
    '''
    begin = time.time()
    try:
      if check_condition == 'isdecimal':
        check = df[column_name].str.isdecimal()
        check = check[check]
        check_list = check.index.values.tolist()

      if check_condition == 'isspace':
        check = df[column_name].str.isspace()
        check = check[check]
        check_list = check.index.values.tolist()

      if check_condition == 'isnumeric':
        check = df[column_name].str.isnumeric()
        check = check[check]
        check_list = check.index.values.tolist()
      
      if check_condition == 'isalpha':
        check = df[column_name].str.isalpha()
        check = check[check]
        check_list = check.index.values.tolist()

        
      if check_condition == 'isdigit':
        check = df[column_name].str.isdigit()
        check = check[check]
        check_list = check.index.values.tolist()

      
      if check_condition == 'islower':
        check = df[column_name].str.islower()
        check = check[check]
        check_list = check.index.values.tolist()

      
      if check_condition == 'isupper':
        check = df[column_name].str.isupper()
        check = check[check]
        check_list = check.index.values.tolist()

      
      if check_condition == 'istitle':
        check = df[column_name].str.istitle()
        check = check[check]
        check_list = check.index.values.tolist()

      return check_list
    except:
      print('\nPlease check the above column type. Only columns of string type are allowed.')
      print('Check if the column is numeric or boolean and remove them from the column_list list.')
    
    end = time.time()
    self.find_time_taken(begin, end)
    

  def get_rows_list(self, df, column_name, value_list):
    '''
    Function: Select rows based on list values 
    Input: Dataframe, column name, list of values 
    Output: Rows of a dataframes based on list of values in a column
    '''
    begin = time.time()
    selected_rows = df[df[column_name] in value_list]
    end = time.time()
    self.find_time_taken(begin, end)
    return selected_rows

  def show_column_details(self, df, column_name):
    '''
    Function: Show descriptive statistics required to divide the feature into classes
    Input: Dataframe, 
    Output: Show descriptive statistics required to divide the feature into classes
    '''
    begin = time.time()
    print('Mean:',df[column_name].mean())
    print('Median:',df[column_name].median())
    print('Max:',df[column_name].max())
    print('Standard deviation:',df[column_name].std())
    print('\n')
    

    print('Min:',df[column_name].min())
    print('25th percentile:',df[column_name].quantile(0.25))
    print('50th percentile:',df[column_name].quantile(0.50))
    print('75th percentile:',df[column_name].quantile(0.75))
    print('Max:',df[column_name].max())
    end = time.time()
    self.find_time_taken(begin, end)







2 - Create an instance of the class

In [212]:
# Instance of DataAssessment class
instance = DataAssessment(df)

## 2. Data Cleaning

### 1. Solve quality issues

1. Round all float values in the dataframe to one decimal places

Code:

In [213]:
df = df.round(1)

Test:

In [214]:
df.head(n=2)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0


2. Remove the row with item_id = 0 

Code:

In [215]:
item_zero_row = df[df['item_id'] == 0]

In [216]:
item_zero_row

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


Test:

In [217]:
df = df.loc[df['item_id'] != 0]
item_zero_row = df[df['item_id'] == 0]
item_zero_row

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


3. Remove 6 duplicate rows

Code:

In [218]:
row_duplicates = df[df.duplicated()]

In [219]:
row_duplicates

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


In [220]:
df = df.drop_duplicates()

Test:

In [221]:
row_duplicates = df[df.duplicated()]
row_duplicates

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


4 - Remove rows where the item_price is negative

Code:

In [222]:
negative_price_row = df[df['item_price'] < 0]

In [223]:
negative_price_row

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


In [224]:
df = df.loc[df['item_price'] > 0]

Test:

In [225]:
negative_price_row = df[df['item_price'] < 0]
negative_price_row

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


5 - Downcast dataframe to save memory

Code:

In [226]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2928485 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int8   
 2   shop_id         int8   
 3   item_id         int16  
 4   item_price      float32
 5   item_cnt_day    float32
dtypes: float32(2), int16(1), int8(2), object(1)
memory usage: 78.2+ MB


In [227]:
float_columns = df.select_dtypes('float').columns
int_columns = df.select_dtypes('integer').columns

df[float_columns] = df[float_columns].apply(pd.to_numeric, downcast='float')
df[int_columns] = df[int_columns].apply(pd.to_numeric, downcast='integer')

Test:

In [228]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2928485 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int8   
 2   shop_id         int8   
 3   item_id         int16  
 4   item_price      float32
 5   item_cnt_day    float32
dtypes: float32(2), int16(1), int8(2), object(1)
memory usage: 78.2+ MB


6 - Remove rows where *item_cnt_day* is negative

Code:

In [229]:
returns = df[df['item_cnt_day'] < 0]
returns

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


In [230]:
df = df.loc[df['item_cnt_day'] > 0]

Test:

In [231]:
returns = df[df['item_cnt_day'] < 0]
returns

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


7 - Combine sales_train and item_categories with item_id

Code:

In [232]:
# Load 'item_categories.csv'
df_item_categories = pd.read_csv('item_categories.csv')
df_items = pd.read_csv('items.csv')

In [233]:
# Merge df_item and df
df_1 = pd.merge(df, df_items, on='item_id')
df_1 = df_1.drop(['item_name'], axis=1)


Test:

In [234]:
df_1.head(n=2)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id
0,02.01.2013,0,59,22154,999.0,1.0,37
1,23.01.2013,0,24,22154,999.0,1.0,37


8 - item_price_class - Divide item prices into 4 classes

In [235]:
# Get Column Details
column_name = 'item_price'
instance.show_column_details(df, column_name)

Mean: 889.1832275390625
Median: 399.0
Max: 307980.0
Standard deviation: 1724.33642578125


Min: 0.10000000149011612
25th percentile: 249.0
50th percentile: 399.0
75th percentile: 999.0
Max: 307980.0


Time taken to run this procedure: 0.19 seconds


In [236]:
# Create the 'item_price_class' column based on 'item_score'

bin_edges = [-1, 249.0, 399.0, 999.0, 307980]
bin_names = [1,2,3,4]
df_1['item_price_class'] = pd.cut(df_1['item_price'], bin_edges, labels=bin_names)



In [237]:
df

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
3,06.01.2013,0,25,2554,1709.0,1.0
4,15.01.2013,0,25,2555,1099.0,1.0
5,10.01.2013,0,25,2564,349.0,1.0
...,...,...,...,...,...,...
2935844,10.10.2015,33,25,7409,299.0,1.0
2935845,09.10.2015,33,25,7460,299.0,1.0
2935846,14.10.2015,33,25,7459,349.0,1.0
2935847,22.10.2015,33,25,7440,299.0,1.0


In [238]:
result = df_1[df_1.isna().any(axis=1)]


In [239]:
result

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_price_class


9 - Remove rows where item_cnt_day > 100



Code:

In [240]:
rslt_df = df_1.loc[(df_1['item_cnt_day'] >= 6)]


In [241]:
rslt_df

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_price_class
1564,27.01.2014,12,12,2574,399.000000,17.0,55,2
5421,10.02.2013,1,31,2748,399.500000,8.0,19,3
7870,04.10.2013,9,9,2833,299.100006,7.0,30,2
7871,05.10.2013,9,9,2833,299.399994,10.0,30,2
8447,14.04.2013,3,26,2836,999.500000,7.0,23,4
...,...,...,...,...,...,...,...,...
2928175,04.10.2015,33,20,21939,1699.000000,8.0,61,4
2928186,03.10.2015,33,20,21941,1699.000000,10.0,61,4
2928189,03.10.2015,33,9,21941,1699.000000,6.0,61,4
2928190,04.10.2015,33,9,21941,1699.000000,7.0,61,4


In [242]:
df_1 = df_1.loc[df_1['item_cnt_day'] < 6]

Test:

In [243]:
rslt_df = df_1.loc[(df_1['item_cnt_day'] >= 6)]
rslt_df

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_price_class


10 - Split date

Code:

In [244]:
# Convert date to datetime
df_1['date'] = pd.to_datetime(df_1['date'])

In [245]:
# Split date
df_1['day'] = df_1['date'].dt.day
df_1['month'] = df_1['date'].dt.month
df_1['year'] = df_1['date'].dt.year
df_1["day_of_week"] = df_1['date'].dt.dayofweek
df_1["is_weekend"] = df_1['date'].dt.dayofweek > 4

Test:

In [246]:
df_1.head(n=1)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_price_class,day,month,year,day_of_week,is_weekend
0,2013-02-01,0,59,22154,999.0,1.0,37,3,1,2,2013,4,False


10 - Remove rows where item_price > 2124


Code:

In [247]:
rslt_df = df_1.loc[(df_1['item_price'] >= 2124)]

In [248]:
rslt_df

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_price_class,day,month,year,day_of_week,is_weekend
4119,2013-02-01,0,25,2719,2699.0,1.0,19,4,1,2,2013,4,False
4120,2013-12-01,0,25,2719,2699.0,1.0,19,4,1,12,2013,6,True
4123,2013-05-01,0,29,2719,2699.0,1.0,19,4,1,5,2013,2,False
4124,2013-07-01,0,4,2719,2699.0,1.0,19,4,1,7,2013,0,False
4125,2013-01-20,0,4,2719,2699.0,1.0,19,4,20,1,2013,6,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2928475,2015-02-10,33,25,13476,7998.0,1.0,11,4,10,2,2015,1,False
2928478,2015-10-29,33,25,16094,2449.0,1.0,64,4,29,10,2015,3,False
2928481,2015-01-10,33,25,7903,12198.0,1.0,15,4,10,1,2015,5,True
2928482,2015-10-29,33,25,7610,2890.0,1.0,64,4,29,10,2015,3,False


In [249]:
df_1 = df_1.loc[df_1['item_price'] < 2124]

Test:

In [250]:
rslt_df = df_1.loc[(df_1['item_price'] >= 2124)]

11 - Save dataframe

Code:

In [251]:
df_1.to_csv('cleaned_1.csv')


In [252]:
# Load only 'sales_train.csv' for now
df_1 = pd.read_csv('cleaned_1.csv')

Test:

In [253]:
df_1.head()

Unnamed: 0.1,Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_price_class,day,month,year,day_of_week,is_weekend
0,0,2013-02-01,0,59,22154,999.0,1.0,37,3,1,2,2013,4,False
1,1,2013-01-23,0,24,22154,999.0,1.0,37,3,23,1,2013,2,False
2,2,2013-01-20,0,27,22154,999.0,1.0,37,3,20,1,2013,6,True
3,3,2013-02-01,0,25,22154,999.0,1.0,37,3,1,2,2013,4,False
4,4,2013-03-01,0,25,22154,999.0,1.0,37,3,1,3,2013,4,False


In [254]:
result = df_1[df_1.isna().any(axis=1)]


In [255]:
result

Unnamed: 0.1,Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_price_class,day,month,year,day_of_week,is_weekend
