# Basic Text Features

In [1]:
# sample text string
text = "Dark matter is one of the greatest enigmas of astrophysics and cosmology"

We will split the string into individual words or tokens. This is also known as __tokenization__.

In [3]:
# split words of the text
text.split()

['Dark',
 'matter',
 'is',
 'one',
 'of',
 'the',
 'greatest',
 'enigmas',
 'of',
 'astrophysics',
 'and',
 'cosmology']

In [4]:
# store the individual words in a variable
words = text.split()

### 1. Number of Words

In [5]:
# word count
len(words)

12

### 2. Number of Spaces

In [6]:
# spaces count
text.count(' ')

11

### 3. Number of Characters

In [7]:
# character count
len(text)

72

Even the spaces have been included.

In [8]:
# character count (excluding spaces)
len(text)-text.count(' ')

61

So, the text string has 61 characters excluding spaces.

### 4. Average Word Length

In [9]:
# empty list for
word_lengths = []

for i in text.split():
    word_lengths.append(len(i))
    
print(word_lengths)

[4, 6, 2, 3, 2, 3, 8, 7, 2, 12, 3, 9]


In [10]:
# average word length
sum(word_lengths)/len(word_lengths)

5.083333333333333

---

# Create Features for Twitter Dataset

Let's create the above mentioned features for a real-life dataset. 

In [11]:
import pandas as pd

In [12]:
tweets = pd.read_csv("Dataset/tweets.csv")

Have a glimpse at the data.

In [13]:
tweets.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


This dataset has 3 features right now. 

1. __id:__ tweet id number, unique for every tweet
2. __label:__ 1 for negative tweet and 0 for positive or neutral tweet
3. __tweet:__ text data

We will create new features from the feature "tweet".


### 1. Word Count Feature

In [14]:
# number of words/terms in the tweets
tweets['word_count'] = [len(i.split()) for i in tweets['tweet']]

In [15]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17
2,3,0,We love this! Would you go? #talk #makememorie...,15
3,4,0,I'm wired I know I'm George I was made that wa...,17
4,5,1,What amazing service! Apple won't even talk to...,23


As you can see, we have a new feature __word_count__. Now let's create a feature of number of spaces in the tweets.

### 2. Space Count Feature

In [16]:
tweets['space_count'] = [i.count(' ') for i in tweets['tweet']]

In [17]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count,space_count
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13,12
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17,16
2,3,0,We love this! Would you go? #talk #makememorie...,15,14
3,4,0,I'm wired I know I'm George I was made that wa...,17,16
4,5,1,What amazing service! Apple won't even talk to...,23,22


### 3. Character Count Feature

In [18]:
tweets['character_count'] = [len(i) - i.count(' ') for i in tweets['tweet']]

In [19]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count,space_count,character_count
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13,12,116
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17,16,115
2,3,0,We love this! Would you go? #talk #makememorie...,15,14,109
3,4,0,I'm wired I know I'm George I was made that wa...,17,16,96
4,5,1,What amazing service! Apple won't even talk to...,23,22,102


### 4. Average Word Length Feature

In [20]:
avg_word_length = []

# nested for loop
for i in tweets['tweet']:
    word_lengths = []
    for j in i.split():
        # length of terms in a tweet
        word_lengths.append(len(j))
    
    # average word length of a tweet
    l = sum(word_lengths)/len(word_lengths)
    
    avg_word_length.append(l)

In [21]:
# create new feature 
tweets['average_word_length'] = avg_word_length

# Build Model

In [22]:
tweets.head()

Unnamed: 0,id,label,tweet,word_count,space_count,character_count,average_word_length
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,13,12,116,8.923077
1,2,0,Finally a transparant silicon case ^^ Thanks t...,17,16,115,6.764706
2,3,0,We love this! Would you go? #talk #makememorie...,15,14,109,7.266667
3,4,0,I'm wired I know I'm George I was made that wa...,17,16,96,5.647059
4,5,1,What amazing service! Apple won't even talk to...,23,22,102,4.434783


In [23]:
X = tweets[['word_count', 'space_count', 'character_count', 'average_word_length']]
y = tweets['label']

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler # for standardization

In [25]:
# split dataset into train and test set
xtrain, xtest, ytrain, ytest = train_test_split(StandardScaler().fit_transform(X), y, 
                                                test_size=0.33, random_state=42)

In [26]:
xtrain.shape, xtest.shape

((5306, 4), (2614, 4))

In [27]:
# fit model
lr = LogisticRegression()
lr.fit(xtrain, ytrain)

In [28]:
# predict on test set
preds = lr.predict_proba(xtest)

In [29]:
preds

array([[0.92294669, 0.07705331],
       [0.59967747, 0.40032253],
       [0.9516382 , 0.0483618 ],
       ...,
       [0.22800467, 0.77199533],
       [0.57410116, 0.42589884],
       [0.85136928, 0.14863072]])

In [30]:
roc_auc_score(ytest, preds[:,1])

0.8634997421167766