In [None]:
import importlib
import nltk
import spacy
import pandas as pd
import re
importlib.reload(nltk)

Here, we are importing the necessary libraries for text preprocessing and loading them. nltk and spacy are natural language processing libraries that help us tokenize and preprocess text. pandas is used for reading and writing data to and from spreadsheets. re is used for regular expressions, which can be used to clean the text.

In [None]:
nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser'])

This line loads the English language model from spacy and disables the named entity recognition and dependency parser components. This is done to make text preprocessing faster, as these components are not needed for text preprocessing.

In [None]:
df = pd.read_excel('output_past.xlsx')

This line reads the Excel file 'output_past.xlsx' into a Pandas dataframe called df.

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords_lower = [s.lower() for s in stopwords]

Here, we are importing a list of English stopwords from nltk. Stopwords are common words that are usually filtered out from text during preprocessing. We are also converting all the words in this list to lowercase.

In [None]:
def text_preprocessing(str_input): 
     words=[token.lemma_ for token in nlp(str_input) if not token.is_punct]
  
     words = [re.sub(r"[^A-Za-z@]", "", word) for word in words]
     words = [re.sub(r"\S+com", "", word) for word in words]
     words = [re.sub(r"\S+@\S+", "", word) for word in words] 
     words = [word for word in words if word!=' ']
     words = [word for word in words if len(word)!=0] 
      
     words=[word.lower() for word in words if word.lower() not in stopwords_lower]
        
     string = " ".join(words)
     return string

This block of code defines a function called text_preprocessing that takes in a string str_input. The function first tokenizes the input using the nlp object we created earlier, and only keeps the lemmatized tokens (i.e., the base form of each word). It then removes any non-alphabetic characters except for the "@" symbol, which is kept to preserve email addresses. Next, it removes any words containing the string "com" and any email addresses. It then removes any words that are just spaces or have a length of zero. After that, it converts all the remaining words to lowercase and removes any stopwords we imported earlier. Finally, the function joins the remaining words back into a string separated by spaces and returns the result.

In [None]:
df['text'] = df['text'].fillna('')
df['news_cleaned'] = df['text'].apply(text_preprocessing)

df.to_excel('output_past(preprocessed).xlsx', index=False)
df.head()

These lines fill any missing values in the 'text' column of df with empty strings, then create a new column called 'news_cleaned' that contains the preprocessed text for each row. The apply() method is used to apply the text_preprocessing function to each row of the 'text' column. Finally, the preprocessed data is written to a new Excel file called 'output_past(preprocessed).xlsx' and the first five rows of the dataframe are printed.