#Data processing



This code processes a collection of text files located in a specified directory. It performs the following tasks:<br>


**Data Collection and Preprocessing:**<br>

 - Retrieves a list of text files in a specified directory.
 - Reads and combines the text from the first file in the list.
 - Extracts information such as sex labels and year labels from the file paths.
 - Creates a DataFrame ('ori_df') to store original text, sex labels, and year labels.
 - Displays the first few rows of the DataFrame.

**Text Tokenization and Filtering:**<br>

 - Loads a French tokenizer and stopwords for text processing.
 - Tokenizes the text from the first file using the French tokenizer.
 - Defines a function to segment and remove stopwords from French text.
 - Processes each document by segmenting and removing stopwords.
 - Prints the length of the filtered text.

**TF-IDF Vectorization:**<br>

 - Utilizes the TF-IDF vectorizer to convert the processed text into a numerical format.
 - Prints the shape of the resulting TF-IDF matrix.

**Saving Results:**<br>

 - Saves the TF-IDF matrix as a NumPy array in a file named 'tf-idf.npy'.
 - Writes sex labels to a text file named 'sex.txt'.

In summary, the code performs text data preprocessing, tokenization, TF-IDF vectorization, and saves the processed data for further analysis or machine learning tasks.

## Part 0. Importing Libraries

In [None]:
# Import necessary libraries
import warnings
import pandas as pd
from time import time
import matplotlib.pyplot as plt
import re, glob
from tqdm import tqdm
import numpy as np
import seaborn as sns
import jieba
from functools import reduce

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

# Ignore warnings
warnings.filterwarnings('ignore')

## Part 1. Data Collection and Preprocessing

In [None]:
# Get a list of text files in the specified directory
txt_list = glob.glob("./data_text/*/*.txt")
print(len(txt_list))
print(txt_list[0])

In [None]:
# Read and join the text from the first file in the list
ori_text = [line.strip() for line in open(txt_list[0], encoding='UTF-8').readlines() if line != '\n']
ori_text = " ".join(ori_text)
print(ori_text)

In [None]:
# Initialize lists to store sex labels, year labels, and processed text
sex_list = []
year_list = []
ori_text_list = []

# Iterate through each text file
for txt_path in tqdm(txt_list):
    try:
        # Extract year label from the file path
        year_label = int(txt_path.split('(')[5][:4])
        year_list.append(year_label)

        # Extract sex label from the file path
        sex_label = int(txt_path.split('(')[4][:1])
        sex_list.append(sex_label)

        # Read and join the text from the current file
        ori_text = [line.strip() for line in open(txt_path, encoding='UTF-8').readlines() if line != '\n']
        ori_text = " ".join(ori_text)
        ori_text_list.append(ori_text)
    except:
        print(txt_path)

print(len(sex_list), len(year_list), len(ori_text_list))


In [None]:
# Create the DataFrame and then displaying the first few rows.
ori_df = pd.DataFrame({'ori_text':ori_text_list, 'sex':sex_list, 'year':year_list})
ori_df.head()

In [None]:
# with open('sex.txt', 'w') as f:
#     for item in sex_list:
#         f.write("%s\n" % item)

with open('year.txt', 'w') as f:
    for item in year_list:
        f.write("%s\n" % item)

##Part 2. Text Tokenization and Filtering

In [None]:
# Load French tokenizer and stopwords for further text processing
tokenizer_french = nltk.data.load('tokenizers/punkt/french.pickle')
stop_words = set(stopwords.words('french'))

# Tokenize the text from the first file using French tokenizer
result = word_tokenize(text=ori_text_list[0], language='french')
print(result)

In [None]:
# Define a function for segmenting and removing stopwords from French text
def seg_depart(sentence, stopwords):
    # Tokenize each line in the document using French tokenizer
    result = word_tokenize(text=sentence, language='french')

    # Define a regular expression to match pure numbers and pure punctuation
    regex = re.compile('^\d+$|^[^\w\s]+$')

    # Remove pure numbers and pure punctuation using the regular expression
    tokens = [token for token in result if not regex.match(token) and token not in stopwords]
    return " ".join(tokens)

In [None]:
# Process each document by segmenting and removing stopwords
filtered_texts = []
for ori_text in tqdm(ori_text_list[:]):
    filtered_text = seg_depart(ori_text, stop_words)
    filtered_texts.append(filtered_text)
print(len(filtered_texts))

##Part 3. TF-IDF Vectorization

In [None]:
# Use TF-IDF vectorizer to convert the processed text into numerical format
tf_vectorizer = TfidfVectorizer(max_features=10000) # Use TF-IDF for numerical processing
tf_fit = tf_vectorizer.fit_transform(filtered_texts)
print(tf_fit.shape)

##Part 4. Saving Results

In [None]:
# Save the TF-IDF matrix as a NumPy array
np.save('tf-idf.npy', tf_fit.toarray())

In [None]:
# Write sex labels to a text file
with open('sex.txt', 'w') as f:
    for item in sex_list:
        # Write to the file
        f.write("%s\n" % item)