## INSIGHTS GATHERED FROM EXPLORATION


#### Job titles

Based on this sample of job titles, we could create : 
    
    * Contract_type (full-time, part-time ...) 
    * Contract_status (CDI, CDD, work-study, internship ...)
    * Duration of Contract (Duration/Undetermined)
    * Experience ( Senior, Junior ...)
    * Data Specialization (Supply chain, Marketing, Clinical ...)
    * Multiple titles (Analyst/Scientist, Scientist/ML Engineer, Manager/Analyst ...)
    * Specific expertise asked for (Python, Power BI ...)
    
    

#### Company name

When ***Unspecified***, companies name can be found in description column.

Possible new columns :

    * Group/Holding (Y/N/NC)
    * Interim company (Y/N/NC)

Based on the number of job posting per company we could potentially infer about : **size of company ? / Amount of data to work on** / 

Adding a time variable and much more data, the number of similar / identical job postings for the same company could maybe give insights on the **company's turnover rate / company's growth / magnitude of need-urgency to hire** ...  

It seems like extracting additional informations without more context will be difficult. Having access to each company's structure information we could create :

    * Size of company
    * Industry
    * Public / Private
    
We'll see if can extract more related informations in the following columns. Otherwise, we could try to scrap **Glassdoor databases** (or similar) to get those informations.remains to be seen ...

#### Location

We could create a map of th repartition of job posting based on location provided.

Some companies don't provide precise location *(ex : location = FRANCE)* and the information is not available in description column either. Further investigations will be needed for these companies, perhaps in conjunction with other databases *(GLASSDOOR / SIRENE databases for instance)*.

#### Description :


We could create :

    * Lenght of description (dunno what informations it could provide yet)
    * Toold required (Excel, Google Tag Manager ...)
    * Coding languages required (R, Python, SQL ...)
    * Skills required (reporting, data visualization)
    * Required experience
    * Duration of contract
    * Avantages (ticket resto)


## Visualizing aggregated job descriptions w/ WordCloud

WordCloud provide a quick and visually appealing way to identify the most common terms and phrases used in job descriptions.

Below i divided my dataframe by search query to get query-related wordclouds. Grouping data by job titles might have be more precise but as there are over 1300 unique job titles, the quick & easy wordcloud solution would have become less interesting. So i used search_query as a proxy.

In [None]:
# groupby data by search query and aggregate descriptions
agg_descriptions_by_query = data.groupby('search_query')['description_normalized'].sum().reset_index()


for job in agg_descriptions_by_query.search_query.values:

    # get aggregated description for each query
    agg_text = agg_descriptions_by_query.loc[agg_descriptions_by_query.search_query == job, 'description_normalized'].values[0]

    # Create a word cloud object and generate the word cloud
    wordcloud = WordCloud(width=800, height=800, background_color="white").generate(agg_text)
    print("\n***",job,"***\n")

    # Display the word cloud
    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()




In [None]:
additionnal_stop_words = ["client", "solution", "service", "donnée", "plus", "busines", "entreprise", "développement", "produit, "team"]

## Performing N-grams tokenization on normalized job descripions

By using N-grams tokenization, we can create tokens that represent not just individual words, but also sequences of words that appear together frequently. This is important because the context in which words are used can greatly affect their meaning.

These tokens might capture more complex and nuanced information about the job description and may provide more accurate insights into the skills, qualifications, and requirements of the job. 

In [None]:
trigrams = []
bigrams = []
monograms = []

def get_most_common_ngrams(data, n, k):
    """
    This function takes a list of strings, an integer n (for the n-gram size), 
    and an integer k (for the number of most common n-grams to return). 
    It returns a list of the k most common n-grams in the input data.
    """
    ngrams_list = []  # Create an empty list to store the n-grams
    
    for text in data:
        
        # Tokenize the text into words
        words = nltk.word_tokenize(text)
        # Generate the n-grams and add them to the list
        ngrams_list.extend(list(ngrams(words, n)))
        
    # Count the occurrences of each n-gram using FreqDist
    freq_dist = FreqDist(ngrams_list)
    
    # Get the k most common n-grams and return them
    return freq_dist.most_common(k)

trigrams = get_most_common_ngrams(data.description_normalized, 3, 1000)
bigrams = get_most_common_ngrams(data.description_normalized, 2, 1000)
monograms = get_most_common_ngrams(data.description_normalized, 1, 1000)

    
# Print the most common trigrams and bigrams
print("The 5 most common trigrams are:\n")
for trigram, count in trigrams[:5]:
    print(' '.join(trigram), count)

print("\nThe 5 most common bigrams are:\n")
for bigram, count in bigrams[:5]:
    print(' '.join(bigram), count)
    
print("\nThe 5 most common monograms are:\n")
for monogram, count in monograms[:5]:
    print(' '.join(monogram), count)