# Ekstraksi Fitur menggunakan Unigram
Setelah dataset melewati tahap preprocessing, selanjutnya adalah ekstraksi fitur dari dataset tersebut menggunakan teknik unigram

## 1. Membaca Dataset

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Muat dataset
file_path = 'processed_tweets.csv'
dataset = pd.read_csv(file_path)

dataset.head()

Unnamed: 0,conversation_id_str,created_at,favorite_count,full_text,id_str,image_url,in_reply_to_screen_name,lang,location,quote_count,reply_count,retweet_count,tweet_url,user_id_str,username
0,349251986438356994,Mon Jun 24 19:45:37 +0000 2013,0,oh hi twilight are you pony ponyville,349251986438356994,,MLP_Alicorn_Twi,en,"Leeds, England",0,2,0,https://x.com/Shelton20_Jack/status/3492519864...,906748764,Shelton20_Jack
1,349044033647157248,Mon Jun 24 06:06:31 +0000 2013,0,great hear twilight filly interesting alicorn,349045855346638849,,NerdyPinkie,en,,0,1,0,https://x.com/Bronywho_/status/349045855346638849,1542576176,Bronywho_
2,348985526667329537,Mon Jun 24 02:08:49 +0000 2013,0,cadance dead twilight alicorn show character d...,348986036539506688,,NEligahn,en,The Grand Ovation Theater,0,0,0,https://x.com/Grimrubix/status/348986036539506688,614900888,Grimrubix
3,348932366217121793,Sun Jun 23 22:35:33 +0000 2013,0,you ok twilight look conserned,348932366217121793,,MLP_Alicorn_Twi,en,"Leeds, England",0,1,0,https://x.com/Shelton20_Jack/status/3489323662...,906748764,Shelton20_Jack
4,348913737111048192,Sun Jun 23 21:21:32 +0000 2013,0,oh hi twilight are you,348913737111048192,,MLP_Alicorn_Twi,en,"Leeds, England",0,1,0,https://x.com/Shelton20_Jack/status/3489137371...,906748764,Shelton20_Jack


## 2. Buat Vectorizer untuk Unigram

In [2]:
# Inisialisasi CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 1))  # Unigram only

# Terapkan CountVectorizer pada dataset
X = vectorizer.fit_transform(dataset['full_text'])

## 3. Eksplorasi Hasil

In [3]:
# Dapatkan daftar fitur
features = vectorizer.get_feature_names_out()

# Tampilkan jumlah fitur
print(f"Number of features (unigrams): {len(features)}")

# Tampilkan 10 fitur pertama
print("Sample features:", features[:10])

Number of features (unigrams): 1230
Sample features: ['__' 'abide' 'able' 'aboard' 'abyss' 'accept' 'accepted' 'according'
 'accurate' 'accustomed']


## 4. Konversi ke Dataframe

In [4]:
# Koversi hasil CountVectorizer ke DataFrame
feature_df = pd.DataFrame(X.toarray(), columns=features)

# Tampilkan 5 baris pertama
print(feature_df.head())

   __  abide  able  aboard  abyss  accept  accepted  according  accurate  \
0   0      0     0       0      0       0         0          0         0   
1   0      0     0       0      0       0         0          0         0   
2   0      0     0       0      0       0         0          0         0   
3   0      0     0       0      0       0         0          0         0   
4   0      0     0       0      0       0         0          0         0   

   accustomed  ...  yes  yesterday  yet  you  youd  youprincess  youre  yup  \
0           0  ...    0          0    0    1     0            0      0    0   
1           0  ...    0          0    0    0     0            0      0    0   
2           0  ...    0          0    0    0     0            0      0    0   
3           0  ...    0          0    0    1     0            0      0    0   
4           0  ...    0          0    0    1     0            0      0    0   

   zelda  zutheskunk  
0      0           0  
1      0           0  

## 5. Export ke CSV

In [5]:
# Convert the sparse matrix to a dense DataFrame
feature_df = pd.DataFrame(X.toarray(), columns=features)

# Save only the feature DataFrame to CSV
output_file_path = 'unigram_features_only.csv'
feature_df.to_csv(output_file_path, index=False)

print(f"Unigram feature DataFrame saved to: {output_file_path}")

Unigram feature DataFrame saved to: unigram_features_only.csv
