# Basic Exploratory Analysis Demo

Import libraries useful for data exploration, including the dataviz library of your choice:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

First, let's load the data (created by Josh Murrey, [downloadable here](https://www.kaggle.com/datasets/thedevastator/books-sales-and-ratings)) :

In [3]:
df = pd.read_csv('data/Books_Data_Clean.csv')
df.head()

Unnamed: 0,index,Publishing Year,Book Name,Author,language_code,Author_Rating,Book_average_rating,Book_ratings_count,genre,gross sales,publisher revenue,sale price,sales rank,Publisher,units sold
0,0,1975.0,Beowulf,"Unknown, Seamus Heaney",en-US,Novice,3.42,155903,genre fiction,34160.0,20496.0,4.88,1,HarperCollins Publishers,7000
1,1,1987.0,Batman: Year One,"Frank Miller, David Mazzucchelli, Richmond Lew...",eng,Intermediate,4.23,145267,genre fiction,12437.5,7462.5,1.99,2,HarperCollins Publishers,6250
2,2,2015.0,Go Set a Watchman,Harper Lee,eng,Novice,3.31,138669,genre fiction,47795.0,28677.0,8.69,3,"Amazon Digital Services, Inc.",5500
3,3,2008.0,When You Are Engulfed in Flames,David Sedaris,en-US,Intermediate,4.04,150898,fiction,41250.0,24750.0,7.5,3,Hachette Book Group,5500
4,4,2011.0,Daughter of Smoke & Bone,Laini Taylor,eng,Intermediate,4.04,198283,genre fiction,37952.5,22771.5,7.99,4,Penguin Group (USA) LLC,4750


Performing an exploratory data analysis is first and foremost about understanding the data. The best way to do this is to answer questions:
- what are the data? What do they represent? How were they obtained?
- what is their type?
- are there any missing data? Outliers?

Responses to these questions will help us to assess the quality of the data, to format them, to prepare "cleaning" procedures.
To go further, it may be interesting to have additional information:
- how are they distributed?
- can we observe notable patterns?
- can we quantify outliers?

We can, then, once we have built a better knowledge of the data and improve our understanding (after an initial cleaning), ask more precise questions about the relationships between them ("explanations" or "effects" from a statistical point of view).

Of course, all these steps and questions are interconnected. An EDA can quickly become "labyrinthine" with many avenues to explore in parallel, etc. Hence the interest in notebooks and writing them correctly (clarity, formatting, organization). It is an iterative process.

Do not forget that originally a client, a department of our company, a colleague, asks a question. Here it could be: "what predicts the number of sales of a book?", or "which books generate the highest turnover?" or "what are the characteristics that make a book receive good reviews?". It is then our knowledge of the profession, our experience, our intuition that will guide us. Being a data scientist is not just about knowing libraries like Pandas or Seaborn, otherwise a simple tool like ydata-profiling would be sufficient to make a very good data analyst!

So let's start from the beginning: let's see what the columns are, their types and whether there are null values, using the method that you should now know very well...

In [1]:
df.info()

First, let's try to understand what magnitude or observable each column represents. Let's take the time to read the labels, see what they relate to and identify the orders of magnitude of the numerical values and the type of the categories in the first lines of the dataframe.

The publication years are coded by `float64`, which is strange. Should we convert this type to `int`?

Another thing, there is an index column that is totally useless (duplicate with the dataframe index) and the column names are not formatted in a uniform way (sometimes with spaces, capital letters, underscores, etc.). To make manipulation of columns easier, let's standardize all the names : replace the spaces with underscores and remove the capital letters. (In the doc, look for the `.rename()` method)

In [3]:
for c in df.columns:
    df.rename(columns={c: c.replace(' ','_').lower()}, inplace=True)
df.columns

What do we see for a particular column? Do you detect a problem?

« your answer here (edit the md cell)»

Fix this:

In [6]:
df.rename(columns={'publisher_': 'publisher'}, inplace=True)

Remove the "index" column:

In [2]:
df.drop('index', inplace=True, axis=1)

Is there duplicates ?

In [3]:
df.duplicated().sum()

Another method to detect duplicates : create a dataframe where duplicates have been removed, and check if the two dataframes have the same number of lines 

In [4]:
df.drop_duplicates().shape

If we get the same dimensions as in the original dataframe, we can conclude -> no duplicates.

What about missing values? (is there a method to identify `NaN` values?)

In [5]:
df.isna().sum()

There are a few missing values. But before deciding how to handle them (delete? replace? by what method?), let's explore the values of these variables more closely.

### Quantitative (numerical) variables

Which method allows to quickly explore numeric values ​​(it is not `.info()`, but the other one!)

In [6]:
df.describe()

This method gives us a first idea of the distribution of the numerical variables, and their order of magnitude. Let's look carefully at each column. First to "understand" the distribution of values in the different columns, but also to identify what we would perceive as anomalies. 

For example, we see that the average evaluation (rating) of the books is quite high, around 4 on a scale of 5 (apparently). We also observe that the number of votes for the evaluation of the books is significant. Order is of several tens of thousands, up to more than 200,000, and never less than 27,000, with an average value a little below 100,000. We also notice that the dispersion in the number of books sold is huge: more than 15,000 for an average of nearly 10,000, and a median very offset from the average. Etc.

The minimum value for the year of publication is negative! Does this seem normal to you, or abnormal?


« Your response here » (edit the md cell)


We will have to look more closely at the distribution of values. We will see this later. First, let's observe what the line looks like for which publication year is equal to -560 (think of boolean indexing):

In [7]:
df[df.publishing_year == -560]

Which author is this (feel free to google…)? Does the value for the year of publication make sense then?

« your response here » (edit the md cell)

Get all books for which the publication year is negative:

In [8]:
df[df.publishing_year < 0]

We have a few, mainly from the Greco-Roman area, but also Chinese (Lao Tzu). We also note that we have a problem encoding non-Latin characters, Greek and Chinese, it seems.

### Qualitative (categorical) variables

Let's see how many categories (= unique values) there are in each of these columns:

In [9]:
for c in df.columns:
    if not pd.api.types.is_numeric_dtype(df[c]):
        print(c)
        print(df[c].nunique())

This data concerns 735 different authors.

Let's look at the distribution of the different categories for the last 4 columns which "actually" designate categories, that is to say a number of different categories quite reduced given the number of records (the titles of the works and the authors necessarily have many "categories" / unique values - are they really categories?).

Which 4 columns do you consider to be "actual" categories?

`'language_code'`, `'author_rating'`, `'genre'`, `'publisher'`

Let’s have a look on the distributions of these categories (for each column retained, how many different/unique values?)

In [1]:
for c in ['language_code', 'author_rating', 'genre', 'publisher']:
    print(df[c].value_counts(), '\n')

What problems were found so far?

Missing data:
* books without titles (23)
* language codes (53)
* year of publication

Outliers:
* at least one negative value for the year of publication

Duplicates:
* clearly no duplicates

Categorical variables:
* for many categorical variables: language, publisher, genre and even author ratings, there are strong imbalances between categories in terms of counts

Show null values for the publication year :

In [None]:
df[df['publishing_year'].isna()] 

Show null values for titles :

In [3]:
df[df['book_name'].isna()] 

Show a *preview* or an extract (there are many values) of works whose language code is not specified:

In [4]:
df[df['language_code'].isna()].head() 

Missing values only concern categorical variables here.
We note that books other than English are very much in the minority. The `language_code` may not have much weight in the analyses.
Furthermore, since each book has a different title, it is also unlikely to observe an effect of the title on the other variables - unless we study very precisely regularities in the titles (format, lexicon, structure, length, etc.) which we are not going to do for now.
Finally, the unknown year of publication concerns a single book (whose title and language are also unknown). The year on the other hand can be an important variable, but since it only concerns one book, this missing value will have little effect.

We may be tempted to delete the lines with missing values, but if it will only slightly affect the effects of the variables concerned, we would at the same time delete other variables which are complete and may be relevant for future analyses. The situation is slightly different for the missing year.

On the other hand, analyzing books that we are not able to identify because we do not know the title is a bit strange. We can doubt the reliability of the information concerning them.

Concerning the books, given the majority of books in English, we can assume that the non-coded languages are books that would have been in English anyway. Or were they not coded because they did not correspond to any listed language? Let's see if the authors' surnames can help us decide? (display authors for books without language code)

In [None]:
df[df['language_code'].isna()].author



 We note a majority of English-language authors, without mention of a translator and the title in English too. We can therefore assume that the first hypothesis is the correct one.

For example, we could decide to:

- drop books whose titles are missing, which will also delete the row where the year is missing

- use a replacement method implemented in pandas that consists of replacing a missing value of a variable with the value of this variable for a nearby element (previous or next in the dataframe), which amounts to duplicating this value. This is practically equivalent to filling in the missing values at random, with a good chance of respecting a priori the frequency of appearance of the values in the sample, without impacting the average too much (see CLT).

Remove lines with null values **for titles** (in the doc look for the `.dropna()` method):

In [6]:
df.dropna(subset='book_name', inplace=True)
df.info()

**For records without a language code**, find a replacement method. Look for the `.ffill()` method in the docs.

In [7]:
df.language_code.ffill(inplace=True)
df.info()

Now that we have a prepared dataset that fits our needs, we can save it :

In [None]:
df.to_csv('data/Books_Data_Prepared_sav.csv')

## Distributions

Let's start with the first column, `publishing_year`, we were already intrigued by some of the values we found there, it can be enlightening to see more precisely the distribution, and if it justifies a first sorting of the data. Let's use Seaborn which is a good compromise between ease of use and readability. Do not forget the bare minimum for a figure: a title, axes labels, etc.

In [1]:
sns.histplot(df['publishing_year'], kde = True)\
            .set(title='Distribution des années de publication');

We can see that the majority of publications took place around the 2000s, with an exponential progression. Let's see what happens from the 19th century onwards (here again, boolean indexing is our friend):

In [2]:
sns.histplot(df[df['publishing_year']>1800]['publishing_year'], kde = True);

There has been an exponential increase in the number of publications since the 1970s, after a sharp increase since the post-war period.

Other distributions may pique our curiosity:

* number of books in each genre:

In [3]:
sns.histplot(df['genre'])

* Distribution of average book ratings. As a variable that we may try to explain, or which may serve as an explanatory variable, it is important to have a clue about the way it is distributed. Let’s see what it looks like:

In [4]:
sns.histplot(df['book_average_rating'], kde = True);

* It’s not just books that receive a rating: authors too. Authors are classified into four categories, from beginner (novice) to famous (famous). Let’s see the distribution in those 4 categories (the calculation has already been done previously, but a data viz is clearer):

In [5]:
sns.histplot(df['author_rating']); # note : plot the kde is nonsense for categorical variables

Even though these are categorical variables, there is a notion of order here: novice < intermediate < excellent < famous.

The two central categories gather 92% of the evaluations (see figures above). It would therefore be necessary to do the data viz by ordering the categories correctly (look in the doc for the argument `order=`:

In [6]:
sns.countplot(x=df['author_rating'], order=['Novice', 'Intermediate', 'Excellent', 'Famous']);

We can ask ourselves what is, for each category, the average level of book ratings. We expect a certain correlation : the best rated books being written by excellent authors. To compare the average ratings in each category we can do a `.groupby()`. Assign the object thus created to `book_author_ratings` variable name.

The methods you will need: a method that provides the quantity you are looking for (`.mean()`, `.count()`… think about what you would write in SQL!), and a method to sort.

What type of object does `.groupby()` return? If you want a dataframe, you will have to use the `.reset_index()` method, and if you want to define an index, `.set_index()` (search in the doc, or experiment by yourself):

In [7]:
book_author_ratings = df.groupby('author_rating')['book_average_rating']\
                            .mean()\
                            .sort_values(ascending = True)\
                            .reset_index()\
                            .set_index('author_rating') # .reset_index() pour avoir un dataframe, 
                                                        # .set_index() pour choisir une colonne en index si on veut
book_author_ratings

Compare the type of `book_author_ratings` with and without `.reset_index()`:

In [8]:
type(book_author_ratings)

Draw your figure (the average book rating by author category):

In [9]:
sns.barplot(data = book_author_ratings, x='author_rating', y='book_average_rating');

Unsurprisingly, the more an author is rated, the more his production is also rated.

We can try to study what the dispersion is in the different categories. Boxplots can be useful for this (to make the figure look nice, think again about the `order=` argument).

In [11]:
sns.catplot(data = df, 
            x='author_rating', 
            y='book_average_rating', 
            kind='box',  
            order=['Novice', 'Intermediate', 'Excellent', 'Famous']);

We notice a surprising phenomenon: the group with the greatest dispersion is that of intermediate authors, which can be explained by the fact that we have a lot of authors in this group, whose production can be unsurprisingly uneven (which explains their status), but in the more restricted club of famous authors, the dispersion is quite low (the scores are very tight around the median) but this does not prevent the presence of outliers, and in particular works certainly classified as disappointing: being among the best does not prohibit failure, and this can be quite bitter.

Authors do not publish their books alone, generally, another actor participates: the publisher. Let's repeat the same analyses with publishers. Unfortunately, we do not have a categorization of publishers. Let's first look at how many different publishers there are, how many books are published by publishers and possibly who these publishers are:

In [13]:
df.publisher.nunique()

The number of publishers is actually quite small. We can therefore draw up the following list:

In [15]:
df.publisher.unique()

We remember that there were strong imbalances between the counts of categorical variables:

In [16]:
df['publisher'].value_counts()\
                .reset_index()\
                .set_index('publisher')

We can create a dataviz to visualize this. Since the names of the publishers are a bit long, it is better to put them in a legend with a color code. We can also make sure that the publishers are ordered in descending order of the number of works published:

In [None]:
# simple way : using .value_counts() to get an ordrered Series (and .reset_index() if you want an ordered dataframe)

 # print(df['publisher'] … your code …)

# complex way (for training) : using .groupby()

#count_ordered_publishers = df.groupby( … your code …

# when categorical labels are long strings, it is better to
# put label in a legend box
fig, ax = plt.subplots()
ax = sns.countplot(data =df, 
                   x='publisher', 
                   hue= 'publisher', 
                   hue_order = count_ordered_publishers, 
                   order=count_ordered_publishers, 
                   ax=ax)
ax.set_xticklabels('')
ax.legend(title='Publisher', 
          labels=count_ordered_publishers,
          loc='upper right', 
          bbox_to_anchor=(1.65, 1.02));

We note the overwhelming superiority of Amazon. We also note that one of the publishers was perhaps distributed among several categories (subsidiaries), quite marginal (n=4).

We can directly plot the boxplots of the evaluations for these 9 publishers. `publisher` is a purely categorical variable, but we can order the results by increasing average ratings of the books (with a `.groupby()`, for practice):

In [34]:
rating_ordered_publishers = df.groupby('publisher')['book_average_rating']\
                            .mean()\
                            .sort_values(ascending = True)\
                            .reset_index()['publisher']

fig, ax = plt.subplots()
ax = sns.boxplot(data = df, 
                 hue='publisher', 
                 x='publisher', 
                 y='book_average_rating', 
                 ax=ax, 
                 order=ordered_publishers, 
                 hue_order = ordered_publishers)
ax.set_xticklabels('')
ax.legend(title='Publisher', 
          labels=ordered_publishers,
          loc='upper right', 
          bbox_to_anchor=(1.65, 1.02));

We could improve these dataviz by ensuring that the same color code is respected from one figure to another.

In any case, we can see that the publishers whose books receive the most favorable evaluations are publishers with a very low number of publications (n=4), so it is not representative. Hachette group perhaps stands out by being, it seems, significantly less well rated than its competitors - to be verified with statistical tests (large error bars).

Finally, another variable that can play a role is the language. Once again, there is a strong disparity in the number of books published in each language category.

First, what are these languages?

In [17]:
df.language_code.unique()

Let's quantify and visualize this:

In [18]:
ordered_languages = df['language_code'].value_counts().reset_index()
print(ordered_languages.set_index('language_code'))

sns.countplot(data =df, 
                x='language_code',  
                order=ordered_languages.language_code)

English constitutes the overwhelming majority of published works, so the language will have only a marginal effect on other variables, especially since the number of books in languages other than English is negligible. It makes absolutely no sense to measure the effect of a category containing only one or two books.

We can look at other elements to build a better representation of the data and the field of activity:

- compare the gross sales of different publishers, authors, genres, etc.
- the number of book sales (and not just the number of books published) by the different categories of authors, etc.
- etc. It's up to you!

To have an overview of the interactions between quantitative variables and their distributions, we can use the `.pairplot()` method:

In [19]:
sns.pairplot(df, height=2);

However, it is not a good idea to systematically look at all the analyses and to retain those that seem to show something. The same goes for when we cross quantitative and qualitative variables. If this does indeed allow us to identify patterns and bring out intuitions, our approach should be justified by relying on hypotheses.

Furthermore, identifying patterns when there is a lot of data and we systematically compare them does not necessarily bring much simplicity and clarity. In this case, proceeding step by step and consciously looking at each variable that we put in relation is not useless, even if it takes longer. In addition, if we were to identify a remarkable relationship, it would look like "cherry picking" (we take what suits our initial conviction), supported by no hypothesis, which is not a good practice and leads nowhere in itself. Nevertheless, it is entirely valid to be surprised by a correlation that we did not suspect, and it is quite possible that it will lead us to fruitful hypotheses. Moreover, if we already have a certain number of points of interest (we suspect different relationships between variables), we can save time by producing several visualizations that we want to explore or test in a single command line. It is therefore not an option to eliminate, but we cannot really make it the heart of our working method.

Here we can be interested in different relationships. The `book_ratings_count` column seems to show a number of trends:

- `book_ratings_count` and `sales_rank` seem to be very correlated (logical! is it really useful?)
- `book_ratings_count` and `book_average_rating`, we can understand the relationship, but it is interesting to see in which direction this relationship goes
- while we are at it, see the relationship between `book_ratings_count` and `gross_sales` (turnover generated by each book)
- especially since `gross_sales` and `book_ratings_count` are very correlated,
- which also leads us to see the link with `units_sold`: out of the number of books sold, how many are rated?
- the `units_sold` line seems to show a particular phenomenon, this variable deserves to be looked at more closely, to understand. The variable `sale_price` is certainly linked to other variables…

We can do the same for categorical variables, or for the link between categorical variables and certain quantitative variables.

In this answer key, we will not exhaustively present all these explorations, we will leave it to you to do so, if you had not already done so during the exercise.

For example, let's start looking at this variable `book_ratings_count`, and in particular how it is already distributed, if there are outliers: if we consider this variable as an explanatory variable, let's see how it behaves.

In [20]:
sns.histplot(df['book_ratings_count'], kde = True);

The distribution seems slightly asymmetrical, with a certain skewness for the high values: some books are highly rated (certainly the best-sellers, the relationship is not linear, some works sweep everything).

In [21]:
df['book_ratings_count'].describe()

There is a ratio of 1 to 10 between the lowest rated book (27,000) and the highest rated book (200,000). The average is a little below 100,000.

In [22]:
sns.boxplot(df['book_ratings_count']);

We see that the exceptional values ​​concern the works that are much more rated than the others, as we had started to suspect with the asymmetrical shape of the distribution. Using an IQR function we could isolate them and see in detail what they correspond to. (to be done in bonus question)

For the moment, let's focus on the relationship between `book_ratings_count` and `book_average_rating`

In [23]:
sns.regplot(data =df, x='book_ratings_count', y='book_average_rating');

What do you observe? What do you think? 

« your response here » (edit cell to answer)

Let's see the relationship between gross sales and number of books rated:

In [49]:
sns.regplot(data =df, x='book_ratings_count', y='gross_sales');

What do you observe? What do you think? 

« your response here » (edit cell to answer)

Let’s now look at the trend between average score and turnover:

In [50]:
sns.regplot(data =df, x='book_average_rating', y='gross_sales');

What do you observe? What do you think? 

« your response here » (edit cell to answer)

Let’s look at the distribution of the number of books sold:

In [24]:
sns.histplot(df['units_sold'], kde = True);

What is remarkable about this distribution?

The distribution is clearly bimodal !

We can therefore try to categorize books between: big sales / small sales for example.

There are obviously other variables that can be treated in this way: we can eliminate books published before a certain date, we can group publishers together (based on hypotheses!), etc. with the aim of creating new, simpler (or more complex) variables that synthesize some of our hypotheses or *parti pris*. This is also a step in data preparation.

But before creating models to predict observables from other characteristics (of books), try to isolate the variables that, according to you, should be retained to explain either the number of books sold (`units_sold`), or the turnover they generate (`gross_sales`), and which variables do not belong in the model.

We will check your choices in the next course where we will finally discuss regression models.

In [25]:

# now, it’s up to you !

