## Understanding the Dataset

**Objective of the Data Analysis**
* The objective depends on the type of project (e.g., classification, regression, clustering). This is more of a conceptual question, but once you've identified the objective, you can start working with the data.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('C:/Users/hp/Desktop/Machine Learning/Datasets/Twitter_Data.csv')

### 1. How big is the dataset?

In [4]:
print(f"Number of rows: {df.shape}")

Number of rows: (162980, 2)


### 2. How does the data look like?

In [5]:
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


In [6]:
df.sample(5)

Unnamed: 0,clean_text,category
104957,modi slogan sab saath sab vikash 5yrs\nsab saa...,-1.0
59134,its high time change the perseption people tha...,-1.0
45203,are very proud our scientists modi after india...,1.0
56713,modi bans tobacco consumption any form will ma...,1.0
23325,east wast modi the best\n,1.0


### 3. What is the data type of columns

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


### 4. Are there any missing values?

In [9]:
df.isnull().sum()

clean_text    4
category      7
dtype: int64

### 5. How does the data look mathematically?

In [11]:
df.describe()

Unnamed: 0,category
count,162973.0
mean,0.225436
std,0.781279
min,-1.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


### 6. Are there duplicate values? 

In [12]:
df.duplicated().sum()

1

### 7. How is the correlation between columns?

In [14]:
# Convert categorical data to one-hot encoding
df_encoded = pd.get_dummies(df, columns=['category'], drop_first=True)

In [16]:
# Drop rows where 'clean_text' is NaN
df = df.dropna(subset=['clean_text'])

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=1000)

# Transform the 'clean_text' column into a numeric matrix
tfidf_matrix = vectorizer.fit_transform(df['clean_text'])

# Convert the matrix into a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Concatenate with the original DataFrame (without 'clean_text')
df_final = pd.concat([df.drop(columns=['clean_text']), tfidf_df], axis=1)

# Proceed with Correlation

In [18]:
numeric_df = df_final.select_dtypes(include=[float, int])

# Calculate correlation
correlation_matrix = numeric_df.corr()

In [19]:
print(correlation_matrix)

          category       100       1st      2012      2014      2019   
category  1.000000 -0.004674  0.001939  0.006868 -0.004343 -0.002363  \
100      -0.004674  1.000000  0.001954 -0.003381  0.012365  0.010684   
1st       0.001939  0.001954  1.000000 -0.002001  0.005421  0.010986   
2012      0.006868 -0.003381 -0.002001  1.000000  0.010100  0.008303   
2014     -0.004343  0.012365  0.005421  0.010100  1.000000  0.065055   
...            ...       ...       ...       ...       ...       ...   
your     -0.002785 -0.007234 -0.004126 -0.004957 -0.008690 -0.014239   
youre    -0.001452 -0.001856 -0.001384 -0.003129 -0.002977 -0.003136   
yourself  0.002690 -0.002797 -0.001423 -0.000098 -0.001209  0.000490   
youth     0.002492 -0.001552 -0.000566 -0.004067  0.006514 -0.002870   
yrs       0.003134  0.002992  0.005416 -0.000175  0.009692 -0.004451   

               4th      6000     72000       aap  ...  yesterday       yet   
category  0.002198 -0.001437  0.000138  0.001259  ...   0