Find the number of times each word appears in drafts

Find the number of times each word appears in drafts.
Output the word along with the corresponding number of occurrences.

In [1]:
import pandas as pd
import numpy as np

In [13]:
google_file_store = pd.read_excel("../CSV/google_file_store.xlsx")
google_file_store.rename(columns={'Tаблица 1': 'filename', 'Unnamed: 1': 'contents'}, inplace=True)
google_file_store.drop(0, inplace=True)
google_file_store

Unnamed: 0,filename,contents
1,draft1.txt,The stock exchange predicts a bull market whic...
2,draft2.txt,The stock exchange predicts a bull market whic...
3,final.txt,The stock exchange predicts a bull market whic...


In [14]:
draft = google_file_store[google_file_store['filename'].str.contains('draft')]
draft

Unnamed: 0,filename,contents
1,draft1.txt,The stock exchange predicts a bull market whic...
2,draft2.txt,The stock exchange predicts a bull market whic...


In [15]:
result = draft.contents.str.split('\W+', expand=True).stack().value_counts().reset_index()
result

# В Python и модуле регулярных выражений (re) есть множество паттернов, которые можно использовать для разделения текста на слова или другие токены. Вот несколько примеров:

# '\s+': Этот паттерн соответствует любым пробельным символам, таким как пробел, табуляция или новая строка. Он используется для разделения текста на слова по пробелам.

# '[a-zA-Z]+': Этот паттерн соответствует любой последовательности символов от a до z (в нижнем и верхнем регистрах). Он используется для извлечения только слов, состоящих из букв.

# '\b\w+\b': Этот паттерн соответствует любому слову, состоящему только из букв, цифр или знака подчеркивания. Он используется для извлечения слов, игнорируя знаки препинания.

# '[.,!?;]: Этот паттерн соответствует знакам препинания, таким как точка, запятая, восклицательный и вопросительный знаки. Он используется для удаления знаков препинания из текста.

Unnamed: 0,index,count
0,a,3
1,market,3
2,The,2
3,make,2
4,of,2
5,stock,2
6,,2
7,happy,2
8,many,2
9,investors,2


Solution Walkthrough
In this walkthrough, we will be using the pandas and numpy libraries in Python to find the number of times each word appears in draft files. We will be reading a file called "google_file_store" and filtering it to only include files that have the word "draft" in their filename. Then, we will count the occurrences of each word using pandas and output the word along with the corresponding number of occurrences.

Understanding The Data
The "google_file_store" is a dataframe that contains information about files in a Google file store. It has columns like "filename" and "contents". We are interested in finding the occurrences of words in the "contents" column for files that have the word "draft" in their filename.

The Problem Statement
We need to find the number of times each word appears in the "contents" column of draft files. We want to output the word along with the corresponding number of occurrences.

Breaking Down The Code
First, we import the pandas library as pd and the numpy library as np.
We create a new dataframe called "draft" by filtering the "google_file_store" dataframe using the "str.contains" method. We pass the argument 'draft' to check if the "filename" column contains the word "draft".
Next, we perform string splitting and counting operations on the "contents" column of the "draft" dataframe.
We use the "str.split" method to split the contents of each row by non-word characters ('\W+').
The resulting dataframe is transformed into a stacked format using the "stack" method, which creates a multi-index series.
We then use the "value_counts" method to count the occurrences of each word and sort them in descending order.
Finally, we reset the index of the resulting series using the "reset_index" method to get a dataframe with two columns: "index" (containing the unique words) and "contents" (containing the count of occurrences).
Bringing It All Together
import pandas as pd
import numpy as np

draft = google_file_store[google_file_store['filename'].str.contains('draft')]
result = draft.contents.str.split('\W+', expand=True).stack().value_counts().reset_index()
Conclusion
In this walkthrough, we used the pandas and numpy libraries to find the number of times each word appears in draft files. We filtered the data based on the filename, split the contents into words, and counted the occurrences of each word.