##  Theodore Roosevelt, “Duties of American Citizenship”

In [1]:
paragraph = """Of course, in one sense, the first essential for a man’s being a good citizen is his
               possession of the home virtues of which we think when we call a man by the emphatic 
               adjective of manly. No man can be a good citizen who is not a good husband and a good
               father, who is not honest in his dealings with other men and women, faithful to his 
               friends and fearless in the presence of his foes, who has not got a sound heart, a
               sound mind, and a sound body; exactly as no amount of attention to civil duties will
               save a nation if the domestic life is undermined, or there is lack of the rude military
               virtues which alone can assure a country’s position in the world. In a free republic the
               ideal citizen must be one willing and able to take arms for the defense of the flag, exactly
               as the ideal citizen must be the father of many healthy children. A race must be strong and
               vigorous; it must be a race of good fighters and good breeders, else its wisdom will come to
               naught and its virtue be ineffective; and no sweetness and delicacy, no love for and 
               appreciation of beauty in art or literature, no capacity for building up material prosperity
               can possibly atone for the lack of the great virile virtues.But this is aside from my subject
               , for what I wish to talk of is the attitude of the American citizen in civic life. It ought
               to be axiomatic in this country that every man must devote a reasonable share of his time to
               doing his duty in the Political life of the community. No man has a right to shirk his political
               duties under whatever plea of pleasure or business; and while such shirking may be pardoned in those
               of small cleans it is entirely unpardonable in those among whom it is most common–in the people whose
               circumstances give them freedom in the struggle for life. In so far as the community grows to think rightly,
               it will likewise grow to regard the young man of means who shirks his duty to the State in time of peace as
               being only one degree worse than the man who thus shirks it in time of war. A great many of our men in
               business, or of our young men who are bent on enjoying life (as they have a perfect right to do if only 
               they do not sacrifice other things to enjoyment), rather plume themselves upon being good citizens if
               they even vote; yet voting is the very least of their duties, Nothing worth gaining is ever gained
               without effort. You can no more have freedom without striving and suffering for it than you can win success
               as a banker or a lawyer without labor and effort, without self-denial in youth and the display of a ready
               and alert intelligence in middle age. The people who say that they have not time to attend to politics are
               simply saying that they are unfit to live in a free community."""

In [2]:
import nltk
import re
import pandas as pd
import numpy

In [3]:
from collections import Counter

In [4]:
# checking most frequent words 
#para = paragraph
#para = para.split()
#counter = Counter(para)
#most_freq = counter.most_common(20)


In [5]:
#type(para)

In [32]:
#most_freq

In [7]:
# tokenize sentences ,words 
# lemmatize
#stopwords
#most frequent words


sentences = nltk.sent_tokenize(paragraph) # tokenizing the paragraph into sentences 
#words = nltk.word_tokenize(paragraph)  # tokenizing the paragraph into words


In [8]:
len(sentences)

10

In [9]:
from nltk.stem import WordNetLemmatizer # converting words into their root words
from nltk.corpus import stopwords       # to remove stopwords(i.e words which are of not that much importance)
lemmatizer = WordNetLemmatizer()


In [10]:
corpus = []

In [11]:
for i in range(0,len(sentences)):
    sentence = re.sub('[^a-zA-Z]',' ',sentences[i])
    sentence = sentence.lower()
    words = sentence.split()
    words = [lemmatizer.lemmatize(word) for word in words if not word in set(stopwords.words('english'))]
    words = ' '.join(words)
    corpus.append(words)
    
    

In [12]:
corpus

['course one sense first essential man good citizen possession home virtue think call man emphatic adjective manly',
 'man good citizen good husband good father honest dealing men woman faithful friend fearless presence foe got sound heart sound mind sound body exactly amount attention civil duty save nation domestic life undermined lack rude military virtue alone assure country position world',
 'free republic ideal citizen must one willing able take arm defense flag exactly ideal citizen must father many healthy child',
 'race must strong vigorous must race good fighter good breeder else wisdom come naught virtue ineffective sweetness delicacy love appreciation beauty art literature capacity building material prosperity possibly atone lack great virile virtue aside subject wish talk attitude american citizen civic life',
 'ought axiomatic country every man must devote reasonable share time duty political life community',
 'man right shirk political duty whatever plea pleasure busines

In [13]:
#len(corpus)

In [14]:
#type(sentence)

In [17]:
# converting text into sparse matrix using bag of words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
mat_cv = cv.fit_transform(corpus)

In [24]:
mat_cv

array([[0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [25]:
# COnverting text into numerics using tf-idf 
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf  =TfidfVectorizer()
mat_tfidf = tfidf.fit_transform(corpus)

In [28]:
mat_tfidf.toarray()

array([[0.        , 0.26698237, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.22323441, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.18750047, 0.15939253,
        0.        ],
       [0.        , 0.        , 0.20434048, ..., 0.        , 0.        ,
        0.20434048],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### tf-idf vectorizer is much better than countvectorizer, as latter one dont give the sentoment weightage to the words (i.e Suppose there are 2 words which are occuring for the same no.of times in the sparse matrix ,then it would be very difficult to understand the weightage of that word ,to overcome such abruption we use tf idf ).