# Dior, Hermes, Kering, and Lousi Vuitton Stock Price (2000 - 2022) Analysis: Data Preparation

Author: Zhongyi (James) Guo <br>
Date: 02/26/2024 <br>
Email: guozy@stanford.edu | jamesguo0320@icloud.com <br>
Personal Website: https://jg1andonly.github.io

In [1]:
import pandas as pd

### Data Cleaning

Data source: The dataset titled `French Luxury Companies 2000-2022.csv` was downloaded from the following URL: https://www.kaggle.com/datasets/prasertk/french-luxury-companies. We have made it available in our repository for easier access.

First, we will load the `French Luxury Companies 2000-2022.csv` dataset into a pandas DataFrame for our analysis.

In [2]:
df = pd.read_csv('French luxury companies 2000-2022.csv')
df.head()

Unnamed: 0,Date,Symbol,Adj Close,Close,High,Low,Open,Volume
0,1999-12-31,CDI.PA,32.674278,61.5,61.5,61.5,61.5,0.0
1,2000-01-03,RMS.PA,37.21077,49.033333,50.0,48.333332,49.666664,16395.0
2,2000-01-03,MC.PA,57.833614,88.800003,93.18,88.199997,91.779999,615855.0
3,2000-01-03,CDI.PA,32.844299,61.82,63.950001,61.75,62.200001,424416.0
4,2000-01-03,KER.PA,120.705658,239.14595,249.091003,235.149338,246.116776,185169.0


Upon initial glimpse, we observed the first observation, recorded on '1999-12-31', falls outside the specified range in our study and pertains solely to Christian Dior (`CDI.PA`). This entry will be treated as noise and subsequently removed from our dataframe.

Additionally, we noticed that the first letter of each column name is capitalized. To maintain consistency and facilitate easier column referencing, we will convert all column names to lowercase and replace spaces with underscores.

In [3]:
df = df.iloc[1:]
df.reset_index(drop=True, inplace=True)
df.columns = [col.lower() for col in df.columns]
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
df.head()

Unnamed: 0,date,symbol,adj_close,close,high,low,open,volume
0,2000-01-03,RMS.PA,37.21077,49.033333,50.0,48.333332,49.666664,16395.0
1,2000-01-03,MC.PA,57.833614,88.800003,93.18,88.199997,91.779999,615855.0
2,2000-01-03,CDI.PA,32.844299,61.82,63.950001,61.75,62.200001,424416.0
3,2000-01-03,KER.PA,120.705658,239.14595,249.091003,235.149338,246.116776,185169.0
4,2000-01-04,RMS.PA,34.934105,46.033333,48.766666,45.533333,48.766666,33063.0


Subsequently, we'll evaluate the presence of null values within this dataframe and examine the data type of each column.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22940 entries, 0 to 22939
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       22940 non-null  object 
 1   symbol     22940 non-null  object 
 2   adj_close  22940 non-null  float64
 3   close      22940 non-null  float64
 4   high       22940 non-null  float64
 5   low        22940 non-null  float64
 6   open       22940 non-null  float64
 7   volume     22940 non-null  float64
dtypes: float64(6), object(2)
memory usage: 1.4+ MB


We identified that the `date` column is better suited as a datetime type, and the `symbol` column should be of a string type. We will proceed to convert these columns to their respective types for improved data handling and analysis.

Also, all numerical variables are continuous, aligning with our expectations.

In [5]:
df['date'] = pd.to_datetime(df['date'])
df['symbol'] = df['symbol'].astype(str)

Additionally, based on the data documentation, the symbols `RMS.PA`, `MC.PA`, `CDI.PA`, and `KER.PA` represent the companies Hermes, Louis Vuitton (LV), Dior, and Kering, respectively. We will replace these with their corresponding company names, ensuring all are in lowercase for consistency. Afterwards, the column name `symbol` is no longer appropriate. We will rename it to `company`.

In [6]:
symbol_to_company = {
    'rms.pa': 'hermes',
    'mc.pa': 'lv',
    'cdi.pa': 'dior',
    'ker.pa': 'kering'
}
df['symbol'] = df['symbol'].str.lower().map(symbol_to_company)
df = df.rename(columns={'symbol': 'company'})
df.head()

Unnamed: 0,date,company,adj_close,close,high,low,open,volume
0,2000-01-03,hermes,37.21077,49.033333,50.0,48.333332,49.666664,16395.0
1,2000-01-03,lv,57.833614,88.800003,93.18,88.199997,91.779999,615855.0
2,2000-01-03,dior,32.844299,61.82,63.950001,61.75,62.200001,424416.0
3,2000-01-03,kering,120.705658,239.14595,249.091003,235.149338,246.116776,185169.0
4,2000-01-04,hermes,34.934105,46.033333,48.766666,45.533333,48.766666,33063.0


We need to rearrange the columns to align with the chronological sequence of stock trading events: 'open', 'high', 'low', 'close', and 'adjusted close', with the remaining columns unchanged.

In [7]:
columns_order = ['date', 'company', 'open', 'high', 'low', 'close', 'adj_close', 'volume']
df = df[columns_order]
df.head()

Unnamed: 0,date,company,open,high,low,close,adj_close,volume
0,2000-01-03,hermes,49.666664,50.0,48.333332,49.033333,37.21077,16395.0
1,2000-01-03,lv,91.779999,93.18,88.199997,88.800003,57.833614,615855.0
2,2000-01-03,dior,62.200001,63.950001,61.75,61.82,32.844299,424416.0
3,2000-01-03,kering,246.116776,249.091003,235.149338,239.14595,120.705658,185169.0
4,2000-01-04,hermes,48.766666,48.766666,45.533333,46.033333,34.934105,33063.0


The general dataframe has been cleaned and prepared. We will now proceed to export it as a CSV file for subsequent analysis and use.

In [8]:
df.to_csv('cleaned_general_data.csv', index = False)

Furthermore, we opted to create separate datasets for each individual company.

In [9]:
grouped_dfs = {company: group for company, group in df.groupby('company')}
df_dior = grouped_dfs['dior']
df_hermes = grouped_dfs['hermes']
df_lv = grouped_dfs['lv']
df_kering = grouped_dfs['kering']

In [10]:
df_dior.to_csv('dior.csv', index = False)
df_hermes.to_csv('hermes.csv', index = False)
df_lv.to_csv('lv.csv', index = False)
df_kering.to_csv('kering.csv', index = False)

In conclusion, our data preparation process has been thorough and meticulous, ensuring that the dataset is now clean, organized, and ready for further analysis. By converting data types to their appropriate formats, examining on missing values, standardizing column names, and segregating the dataset into individual files for each company, we have laid a solid foundation for the upcoming stages of our analysis. This structured approach not only enhances the reliability of our analysis but also facilitates a more efficient exploration of trends, patterns, and insights within the stock performance of these luxury companies. Moving forward, this prepared dataset will serve as a pivotal asset in our endeavor to extract meaningful conclusions and support data-driven decision-making.