# Feature engineering :
- For ML dta must be in tabular form and numerical
- for categorical : One hot encoding
      pd.get_dummies(df, columns=[ 'name' ])
  - not mentioning columns : automatically encode all non numeric features
  - pre-processing (reduction -> reduce, converting to the base form)
  - Vectorization
  - Extract basic word features (length, hashtags, etc)
  - POS tagging : part of speech tagging , I -> pronoun
  - Named Entity recognition

## Basic Feature Extraction :
- Number of characters, including whitespaces
- applying function :
      df['new_col'] = df['old_col'].apply(function)
- number of words : str.split(' ')
- create a new function to apply function
- Average word length
      Special features (word.startswith(#))
- other basic features, number of sentences, numbers, etc.

## Readability tests :
- english knowledge
- primary to graduate level
- mathematcial formula using the word, syllable and sentence count (fake news, spam detection)
- eg: flesch reading ease, gunning fog index, simple measure of gobbledygook (SMOG), Dale-chall Score
- Flesch :
  - odlest, widely
  - based on sentence length
  - based on number of average syllables
  - higher score -> easier to understand
- Gunning:
  - average sentence length
  - uses % of complex words (3 or more syll)
  - higher -> more difficult
- using python
  - from textatistic
  - Textatistic(text).scores
  - is dict

## Tokenization and Lemmatization
- Text from variety of sources
- standardizing is important
- Text preprocessing:
  - Tokenization : splitting a string into token based on language. eg, words, sentences, punctuations
  - expanding contracted words
  - Library :
         import spacy
         model : spacy.load('en_core_web_sm')
         doc = model(string)
         tokens = [token.text for token in doc]

  - Lemmatization : words into its base form
  - am,are,is - be, n't - not, etc.
  - spacy is done automatically
        lemma = [token.lemma_ for token in doc]
  - is/etc is converted to '-PRON-' meaning its a pronoun

  - NLTK is good for sentence tokenization, spacy for words

## Text cleaning:
- isalpha(), isnumeirc(), using regex
- stopwords, ignore : spacy.lang.en.stop_words.STOP_WORDS

## Part of speech tagging :
- word sense disambiguation (bear/bear)
- noun vs verb
- sentiment analysis / question answering / fake news
- assigning word/token its part of speech
- using spacy :
      pos = [(toke.text, token.pos_) for token in doc]
  - accuracy depends
  - 20 parts of speech

## Named entity recognition :
- search algorithms, question answering, news article classification
- identifying and classifying
- spacy:
      ne = [(ent.text,ent.label_) for ent in doc.ents]
   (15 different types of named entity)

## Vectorization : creating a big bag of words:
- data: tabular + numerical
- converting text documents into numerical
- extract word tokens, compute frequency,
construct a word vector using frq anf vocab.

- Using Sklearn :
- create a text corpus using pd.series[]
  - Countvectorizer :
        from sklearn.feature_extraction.text import CountVectorizer
        Vectorizer = CountVectorizer()
  - generate matrix of 2D vectors:
        bow_matrix = vectorizer.fit_transform(corpus)
        bow_matrix.to_array()
  - Convert bow_matrix into a DataFrame :
        bow_df = pd.DataFrame(bow_matrix.toarray())
    - Map the column names to vocabulary :
        bow_df.columns = vectorizer.get_feature_names()

## BoW Naive Bayes Classifier :
- Spam Filtering (Spam Vs Ham)
  - Using CountVectrizer arguments :
      lowercase, strip_accents('unicode','ascii', 'None')
      stop_words ('english','list','None')
      token_pattern (regex) / tokenizer : function
    - No lemmatization
    - Main job : matrix
    - if words in test not in train, countvectorizer ignores those words
- Training :
      from sklearn.naive_bayes import MultinomialNB
      clf = MultinomialNB()
      clf.fit(x_train_bow, y_train)

Project : Spam Filter

## N-gram Models :

- BoW representational problems (contex lost)
- contiguous sequence of elements (BoW : n=1)
- n=2, (a boy, boy lost,lost his), etc.
- captures more context (not)
- sentence completion, spelling correction, etc. It basically computes the probability of the words occuring together
- CountVectorizer :
      bigrams = Countvectorizer(ngram_range=(2,2))
  - genreates unigrams, bigrams, trigrams
- increasing curse of dimensionality

## Building tf-idf document vectors :
(Term Frequency inverse document frequency)
- commonly uccuring words increases dimensions
- Exclusivity, assigning more **weight** to word
- example : jupiter vs universe, characterizing a document using that word
- Automatically detect stop words, search, recommender systems
- tf-idf : weight of the document should be proportional to its frequency and inversely prop to the number of document it occurs in.
   - **Formula** weight = word freq in doc * log(number of docs/number of docs containing word)
   - more weight, more characteristic
   - scikit-learn :
          from sklearn.feature_extraction.text import TfidfVectorizer
          vect = TfidfVectorizer()
          tfidf_matrix = vect.fit_transform(corpus)
          tfidf_toarray()
    - weights are non-integer
- magnitude of if-idf vector is always 1.

## Cosine similarity :

- Similarity between 2 vectors, i.e documents
- Cosine similarity score
- cos theta = A.B / magnitude
- NLP cosine values bw 0 1 (as document vectors use non negative weights)
- Cosine score ignores the mag of the vectors, therefore. robust to document length
- Scikit learn :
  - takes only 2D arrays as arguments
        from sklearn.metrics.pairwise import cosine_similartiy
        score = cosine_similarity ([a],[a])

## Building a plot line based recommender :

- Suggests movies based on overviews
- recommender function :
- ignores the highest similarity score of 1
- Since the magnitude of tf-idf vector is always 1, the cosinbe score will always be equal to the dot product
- Linear_kernel function (same, import linear_kernel)

## Beyond n-grams, Word Embeddings :

- synonyms : happy,joyous and sad would be vectorized similarly . h-j same score as h-s.
- word embeddings : mapping words to an n-dimensional vector space, produced using deep learning and huge amounts of data. Similarity of words
- complex relationships (king/queen, man/woman)
- dependant on pretrained model

- Spacy :
      nlp = spacy.load('en_core_web_lg')
- generate word vector for each token :
      for token in doc: print(token.vector)
- Word similarity score :
      doc = nlp('happy sad joyous')
      for token1 in doc:
      for token2 in doc:
      print(token1.text, token2.text, token1.similarity(token2))
- Document similarity :
  - between sentences as well













# Lectures 3 Feature Engineering :

## Count based models :
- Vectorization, convert the words into vectors
- Combine the vectors to create a vectorspace
- unique words creates a vectorspace
- BOW : loses context, semantics
- tf-idf, n grams


## Word vectors :
- Idea of distributed representation / dependance on other words (linear algebra)
- representing word vectors as continuous mutidimensional floating point numbers
- semantically similar words will be mapped nearer (Realtions : singular, plural, gender, other relations)
- Word2Vec Model, the contenxt for each word is in its nearby words (Google) : distributional hypothesis
-<b> Word embedding : </b> The transformation of word to vectors is word embedding
- ** Countinuous bag of words + Skip Gram**
- CBOW :Learns embedding by predicting the current word based on the context (faster, has more accuracy for frequent words)
- Skip gram : learns embeddings by preding surrounding words based on the current word (works well with small amt of data, better with rare words/ phrases)

### CBOW :
[Link](https://iu.instructure.com/courses/2165940/files/160988843/download?wrap=1)
- weight is assigned to the current word using the context window
- Word2Vec is unsupervised - no label
- Keras :
      from keras.preprocessing import text
      from keras.utils import np_utils
      from keras.preprocessing import sequence
- Preprocess :
      from nltk.tokenize import sent_tokenize
      from nltk.tokenize import RegexpTokenizer
      from nltk.corpus import stopwords
      from nltk.corpus import gutenberg
      alice = gutenberg.raw(fileids='carroll-alice.txt')

- function :
      def norm(text):
        norm_text = []
        tokenizer = RegexpTokenizer('[a-zA-Z]+')
        tokens_sentences = [tokenizer.tokenize(t) for t in
        sent_tokenize(text)]
        stop_words = stopwords.words('english')
        for s in tokens_sentences:
          w_norm = []
          for w in s:
            if not w.lower() in stop_words:
              w_norm.append(w.lower())
              norm_text.append(' '.join(w_norm))
      return(norm_text)

- Create a vocab :
  tokenizer :
      tokenizer = text.Tokenizer()
      tokenizer.fit_on_scale(corpus)
      word2id = tokenizer.word_index
    - lower integer : more frequent word
  
  - generate context_window, target word pairs
  - **Yield : generator iterator**
  - code :

      def generate_context_word_pairs(corpus, window_size, vocab_size):
        context_length = window_size*2
        for words in corpus:
          sentence_length = len(words)
          for index, word in enumerate(words):
            context_words = []
            label_word = []
            start = index - window_size
            end = index + window_size + 1
            context_words.append([words[i]
              for i in range(start, end)
                if 0 <= i < sentence_length and i != index])
                  label_word.append(word)
          x = sequence.pad_sequences(context_words, maxlen=context_length)
          y = np_utils.to_categorical(label_word, vocab_size)
        yield (x, y)

  - Create a sequential model :
        vocab_size = len(word2id)
        embed_size = 100
        window_size = 2
        
        cbow = Sequential()
        cbow.add(Embedding(input_dim=vocab_size,output_dim=embed_size,
        input_length=window_size*2))
        
        cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=()))
        
        cbow.add(Dense(vocab_size, activation="softmax"))
        
        cbow.compile(loss='categorical_crossentropy', optimizer=)

- or epoch in range(1, 3): # 2 epoch for demo. Use more epoch
      loss = 0.
      i = 0
      for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
      i += 1
      loss += cbow.train_on_batch(x, y)
      if i % 100000 == 0:
      print('Processed {} (context, word) pairs'.format(i))
      print('Epoch:', epoch, '\tLoss:', loss)
      print()

- Get Word Embeddings :
      cbow.get_weights()[0]
      from sklearn.metrics.pairwise import euclidean_distances
      distance_matrix = euclidean_distances(weights)

      Similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-
      1].argsort()[1:6]+1]
      for search_term in ['alice', 'queen', 'rabbit']}

- Visualize :
      from sklearn.manifold import TSNE
      import pylab as plt
      words = sum([[k] + v for k, v in similar_words.items()], [])
      words_ids = [word2id[w] for w in words]
      word_vectors = np.array([weights[idx] for idx in words_ids])
      print('Total words:', len(words), '\tWord Embedding shapes:', word_vectors.shape)
      
      tsne = TSNE(n_components=2, random_state=0, n_iter=10000, perplexity=3)
      np.set_printoptions(suppress=True)
      T = tsne.fit_transform(word_vectors)
      labels = words
      plt.figure(figsize=(10, 6))
      plt.scatter(T[:, 0], T[:, 1], c="steelblue", edgecolors="k", s= 40)
      
      for label, x, y in zip(labels, T[:, 0], T[:, 1]):
      plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points', fontsize = 16)


## Vector based models :
[Link](https://iu.instructure.com/courses/2165940/files/160988885/download?wrap=1)

- BOW Count based : TF, IDF, N-grams
- Prediciton based : Distributed representations word embedding vector, Word2vec, FastText, dense representation in multiple dimensions
- Word2Vec : has a small window, unable to learn from global frequency, smallest unit : word, negative sampling, each training sample only updates a % of model's weight
- GloVe : word, frequent occ : carry additional info,
(in a way count based model), no window
- Both face issues with unknown words, remedy : treat all words as out of vocab
- FastText : smallest unit : character (wh,whe,ere), each word as a bag of character ingrams
- Sentence embedding : Doc2Vec (skip through models):
  - Unsupervised : Skip-thoughts, quick-thoughts
  - supervised : InferSent


# Week 4: Similarity

##  Distance Metrics Summary :
1. Euclidean distance - the diagonal line - the shortest path between two points, also known as L2 Norm and Pythagorean metric.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html



2. Manhattan distance -  the total sum |absolute value| of the differences between the x-coordinates and y-coordinates. Also known as  L1 distance or L1 norm, or city block distance.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cityblock.html



3. Minkowski distance is a generalization of Euclidean and Manhattan distance, and it defines the distance between two points in a normalized vector space.  When p = 1, the distance is Manhattan; when p = 2, the distance is Euclidean.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html


4. Jaccard Similarity Coefficient defines similarity between finite sets as the quotient (a result obtained by dividing one quantity by another) of their intersection [dividend] and their union [divisor].

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html


- Jaccard index /Similarity Coefficient is between 0 and Closer to 1 - more similar!
  
  - Jaccard distance measures dissimilarity between 2 sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1.
  - Closer to 1 - more distant!
  - jaccard distance formula: 1 minus Jaccard index
  - Lemmatization is necessary for jacard index (reducing to the root words)

        Defined as intersection over union:
        # Jaccard index (similarity) + jaccard distance (dissimilarity) = 1



5.  Cosine distance measures the degree of angle between two documents/vectors. Dividend is a dot product of two vectors, divisor is a product of Euclidean normsLinks to an external site. for each vector
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html

  - The value close to 1 indicates very high similarity between the two vectors/documents
  
  - Document is converted to a vector (Rn) where n is the number of unique words and each element has a value associated with a word (e.g. TF, TFIDF, CBOW). Note - common frequent words will influence the similarity score

          Have a corpus with 2 sentences :

          corpus = [sen_1,sen_2]

          vectorizer = CountVectorizer(stop_words='english')

          t = vectorizer.fit_transform(corpus)
          modelt = t.to_array()

          # cosine similarity :

          from sklearn.metrics.pairwise import cosine_similarity

          cosine_similarity(model)

          # convert the cosine similarity into a df:

          d = pd.DataFrame(get_cosine_sim(corpus),index = ['Doc1','Doc2'])
          d.columns = ['Doc1','Doc2']

- Cosine vs Jaccard similarity:

  - Jaccard takes unique set of words into consideration, cosine takes the total length of the vectors
  - **For Jaccard, word repetitions would'nt make a difference, for Cosine, the length of the vector would change if the word was repeated**


## Document clustering using unsupervised learning:

- an unsupervised learning to group data points(documents) into groups or clusters
- 2 types:
  - Partisional : division into non-overlapping groups
  - Hierarchical : nested clusters are formed as an hierarchical tree, futher divided into:
    - Agglomerative Clustering (Bottom up), seperate nodes at the bottom
    - Divisive : Top down -> the most heterogenous clusters are divided into 2

- Agglomerative clustering:
  - Measuring dissimilarity in the clusters:
    1. Max / complete linkage: computes all pairwise similarites between doc1 and doc2, then chooses the largest value, tends to produce more compact clusters
    2. Min / Single linkage: pairwise, considers the smallest value, lose clusters
    3. Mean / Average linkage: pairwise, considers the average of the dissimilarities
    4. Ward Linkage: Minimizes total within cluster variance


- Pairwise document similarity: if we have C documents in a corpus, we will end up with c by c matrix

      # PLotting the dendogram:

      from sklearn.metrics.pairwise import cosine_similarity

      similarity_matrix = cosine_similarity(tv_matrix)

      # using the tf-idf scores from the normalised corpus:

      from scipy.cluster.heirarchy import dendogram,linkage

      Z = linkage(similarity_matrix,'ward')

      pd.DataFrame (Z, columns=['Document\Cluster1', 'Document\Cluster2', 'Distance','Cluster Size'], dtype='object')

      # plotting

      plt.figure(figsize=(8,3))
      plt.title('')
      plt.xlabel('Data point')
      plt.ylabel('Distance')

      dendogram(Z)

      plt.axhline(y=1.0, c='k',ls='--',lw=0.5)
      plt.show()

      # by adding labels to the documents

      corpus_df = pd.DataFrame({'Document':corpus , 'Category':labels})
      corpus_df = corpus_df[['Document','Category']]

      from scipy.cluster.heirarchy import fcluster

      max_dist = 1.0
      cluster_labels = fcluster(Z, max_distance, criterion = 'distance')

      cluster_labels = pd.dataFrame (cluster_labels, columns=['ClusterLabel'])
      pd.concat([corpus_df,cluster_labels], axis = 1)

## Topic Modeling:

- Extracting key themes from a corpus of documents
- Each topic is represented as a collection of words
- **Latent Dirichlet Allocation (LDA):**
  - LDA is a probabilistic model used for topic modeling, particularly in the field of natural language processing (NLP).
  - Its primary goal is to discover topics within a collection of documents and assign documents to one or more of these topics.
  - LDA assumes that each document is a mixture of various topics, and it aims to estimate the topic proportions for each document and the word distribution for each topic.
  - LDA is an unsupervised technique and does not involve any classification or discrimination tasks. Its purpose is to reveal the underlying thematic structure of a document corpus.

- Using Scikit learn to generate the topics (so we dont have to specify them manually)

- Document-term matrix is divided into :
  - Document - topic matrix (feature matrix)
  - Topic-term matrix (potential topics in the corpus)

        cv = CountVectorizer(min_df=0., max_df=1.)
        vocab = cv.get_feature_names()
        cv_matrix = cv.fit_transform(norm_corpus)

        from sklearn.decomposition import LatentDirichletAllocation
        
        # n_components = number of topics, the number of topics is less, dimesionality reduction happens
        
        lda = LatentDirichletAllocation(n_components=3, max_iter=10000,random_state=0)
        
        dt_matrix = lda.fit_transform(cv_matrix)
        
        features = pd.DataFrame(dt_matrix, columns=['T1', 'T2', 'T3'])

        tt_matrix = lda.components_
        for topic_weights in tt_matrix:
          topic = [(token, weight) for token, weight in
        zip(vocab, topic_weights)]
        
        topic = sorted(topic, key=lambda x: -x[1])
        topic = [item for item in topic if item[1] > 0.6]print(topic)
        print()



# NLP With SpaCY

## Basics
- Insights from unstructured data
- statistics, machine learning and deep learning
- Ner, sentiment analysis, text generation
- spaCy, information extraction (spacy.io)
- NLP object stores the processed text in the doc object
- tokenization :

      import spacy
      nlp = spacy.load('en_core_web_sm')
      text = ''
      doc = nlp(text)
      print [token.text for token in doc]

- mutiple data structures to represent text:
  - Doc (linguistic annotations of text)
  - Span (a slice from a doc object)
  - Token (an individual token)

- Pipeline components :
  - Tokenizer, tagger lemmatizer, entity recognizer
  - sentence segmentation : uses DependencyParser
        for sent in doc.sents:
        print(sent.text)
  - Lemmatisation :
        print ([token.text, token.lemma_ for token in doc])

        document = nlp(text)
        tokens = [token.text for token in document]
        
        # Append the lemma for all tokens in the document
        lemmas = [token.lemma_ for token in document]print("Lemmas:\n", lemmas, "\n")
        
        # Print tokens and compare with lemmas list
        print("Tokens:\n", [token.text for token in document])

        # Generating a documents list of all Doc containers
        documents = [nlp(text) for text in texts]
        
        # Iterate through documents and append sentences in each doc to the sentences list
        sentences = []
        for doc in documents:
        sentences.append([s for s in doc.text])
        
        # Find number of sentences per each doc container
        print([len(s) for s in sentences])

        sentences = [[sent for sent in doc.sents] for doc in documents]
  
## Linguistic features in SpaCy

- POS tagging(part of speech) : verb, noun, adj, adv, conj
  - to confirm the meaning of the words
  - spacy.explain()
        print(
          [(token.text, token.pos_, spacy.explain(token.pos_))
          for token in nlp(sent)]
          )
- Named Entity recog
  - Person, org, gpe(geop-political location), loc (eg. mountain ranges), date, time
  - doc.ents(), label : .label_
        print(
          [(ent.text, ent.start_char, ent.end_char, ent,label_)
          for ent in doc.ents]
          )
  - Alternate approach :
        print (
          [(token.text, token.ent_type_)
          for token in doc]
          )
  - DisplaCy : visualizer
        import spacy
        from spacy import displacy
        text = ''
        nlp = spacy.load('en_core_web_sm')
        doc = nlp(text)

        display.serve(doc, style='ent')


- Importance of POS :
  - word sense disambiguation
  - dependancy parsing :
    - explores a sentence syntax, links between two tokens, ressults in a tree
    - dependency label describes the tyoe of syntactic relation between 2 tokens
    (nominal subject, root, determiner, direct object, auxiliary)
  - Displaying a dependency tree:
        doc = nlp(text)
        spacy.displacy.serve(doc, style='dep')
      - parent -> dependent
            token.text, token.dep_, spacy.explain(token.dep_)




## Introduction to word vectors :

- word vectors, word embeddings are numercial representations of text data
- Doesn't help the computer to understand the meaning od the sentence, if the word embeddings are different, hence the model is oblivious to context and semantics
- Adding pre-defined number of dimensions to the word vectors (eg: living being, feline, human, gender, royalty, verb, plural, etc)
- Multiple approaches : Word2Vec, Glove, fastText, transformer based approaches
- spacy models with word vectors :
  - en_core_web_md / lg : how to check
        nlp.meta["vectors"]
  - we can inly use the word vectors that exist in the model's vocab
        nlp.vocab
        like_id = nlp.vocab.strings['like']
        nlp.vocab.vectors[like_id]
        (can be used to access the word vectors using it's id)

## Visualizing word vectors and similar context :
- Using a scatter plot to visualize how the word vectors are grouped
- extracting the principal components using PCA
- using matplotlib, spacy, scikitlearn
      import matplotlib.pyplot as plt
      from sklearn.decomposition import PCA
      import numpy as np

      # extract the word vectors for a given list of words and stack them vertically

      words = ['','']
      word_vectors = np.vstack(
        [nlp.vocab.vectors [nlp/vocab.strings[w]] for w in words]
      )

      # extract 2 principal components and project them into 2d space

      pca = PCA(n_components = 2)
      word_vectors_transformed = pca.fit_transform(word_vectors)

      # visualising the scatter plot
      plt.figure(figsize = (10,8))
      plt.scatter(word_vectors_transformed[:,0],
      word_vectors_tranformed[:, 1])

      # adding words to the plot :

      for word, coord in zip(words, word_vectors_transformed):
        x,y = coord
        plt.text(x,y,word, size=10)
      plt.show()

- Analogies and vector operations :
  - a word analogy is a sematic relationship between a pair of words
  - word embeddings generate analogies such as gender and tense, queen - woman + man = king

  - finding similar words (semantically) in the vocab :
        K-means :
        import numpy as np
        import spacy
        nlp = spacy.load('en_core_web_md')

        word = 'covid'

        most_similar_words = nlp.vocab.vectors.**most_similar** (
          [nlp.vocab.vectors[nlp.vocab.string[word]]],n=5
        )

        words = [nlp.vocab.strings[w] for w in most_similar_words[0][0]]

        print(words)
    
  - Semantic similarity help us categorize text into predefined categories or to detect relevant text / flag duplicate content
    - finding similarity scores : cosine similarity
          token1 = doc[2]
          token2 = doc[3]
          similarity = round(token1.similarity(token2),3)
    - similarly spacy can calculate for 'span' object (span of a doc object)
          span1 = doc1[1:], span2 = doc2[1:]
          span1.similarty(span2)
    - doc similarity : same, doc1.similarity
  - getting contextually similar sentences
         sentences = nlp(sentences)
         keyword = nlp(word)

         for i, sentence in in enumerate(sentences.sents):
          print(
            f" Similarity score with sentence {i+1} :
            round(sentence.similarity(keyword),5)
          )

## Spacy Pipelines

- creating spacy pipelines / adding them to an existing pipeline:
  - Adding pipes : sentence segmentation for a document with 10k sentences , long and time consuming (**sentencizer**)
  - when using an existing model on the text, the whole nlp pipeline gets activated
        doc = en_core_sm_nlp(text)
  - A better method is to create a blank spacy model so that only the sentence segmentation part of the pipeline runs
        # create a blank model and add a 'sentencizer' pipe
        blank_nlp = spacy.blank('en')
        blank_nlp.add_pipe('sentencizer')
        doc = nlp(text)
  - Analyzing pipeline components to check whether any attributes are not set
    - nlp.analyze_pipes():
      - attributes they set on the doc and token
      - scores produced training
      - shows warnings is the component values are not set
            nlp = spacy.load('en_core_web_sm')
            analysis = nlp.analyze_pipes(pretty=True)

### EntityRuler

- lets us add entity to doc.ents, or can be used on its own
- using dictionary for patterns
  - Phrase entity patterns (for exact string matches): {'label':'ORG', 'pattern':'Microsoft'}
  - Token entity patterns (with one dictionary describing one token(list)) : {'label : 'GPE', 'pattern' : [{"Lower":'san'},{'Lower': 'francisco'}]}
        nlp = spacy.blank('en')
        entity_ruler = nlp.add_pipe('entity_ruler')
        patterns = [{'label':'ORG', 'pattern':'Microsoft'},{'label : 'GPE', 'pattern' : [{"Lower":'san'},{'Lower': 'francisco'}]}]
        entity_ruler.add_patterns(patterns)

        doc = nlp(text)
        print([(ent.text, ent.label_) for ent in doc.ents])

        # adding the 'before' will add the new pattern if there is no overlap
        ruler = nlp.add_pipe('entity_ruler', before ='ner')


### RegEx / matcher with Spacy
- for instance :
      import re
      pattern = r"((\d){3} - (\d){3} - (\d){4})"

      iter_matches = re.finditer(pattern, text)

      for match in phones:
      start_char = match.start()
      end_char = match.end()

- with spacy
  - patterns =
        [{"label":'Phone_number'}, 'pattern':[{'SHAPE':'ddd'},{'ORTH':'-'},{'SHAPE':'ddd'},{'ORTH':'-'},{'SHAPE':'dddd'}]]

- Alternative to regex : Matcher class

      import spacy
      from spacy.matcher import Matcher
      nlp = spacy.load('en_core_web_sm')
      doc = nlp(text)
      
      matcher = Matcher(nlp.vocab)

      pattern = [{"Lower":'good'},{'Lower'{'IN':['eveninng','morning']}}]

      matcher.add('morning_greeting',[pattern])
      matches = matcher(doc)

      for match_id, start, end in matches :
        print('start_token:', start, "|End token:", end, "|Matched Text",
        doc[start:end])

  - Allows the patterns to be more expressive (IN / NOT IN)
  - Phrase matcher helps tp match long phrases

        from spacy.matcher import PhraseMatcher
        nlp = spacy.load('en_core_web_sm')
        matcher = PhraseMatcher(nlp.vocab)

        # can add attr = 'Lower' (to match words), 'SHAPE' (101.12.2)

        terms = ['Bill Gates']

        patterns = [nlp.make_doc(term) for term in terms]
        matcher.add('PeopleOfInterest',patterns)

## Customiziing
- Not seen during training (# in twitter, medical data)
  - can't be classfied accurately using spacy's NER models
  - Training spacy models : common entities could be different to the specialised ones

- Data Preparation :
  1. Annotate and prepare input data
  2. Initialize the model weights
  3. predict using current weights
  4. compare prediction with correct answers
  5. use optimizer to calculate weights that improve model performance
  6. update weights
  7. repeat 3

- Annotated data : has to be stored as a dictionary
      {
        'sentence' : text,
        'entities' : {
          'label' : 'Medicine',
          'value' : 'neuraminidaise inhibitors'
        }
      }
- Training data :
      [
        ('I will visit you in Austin', {'entities':[(20,26,'GPE')]})
      ]

- We can't put raw data directly into the model, we need to create an example object

      import spacy
      from spacy.training import Example
      nlp = spacy.load('en_core_web_sm')

      doc = nlp('I Live in Austin')
      annotations = {'entities': [(20,26,'GPE')]}

      example_sentence = Example.from_dict(doc,annotations)
      print(example_sentence.to_dict())



      # for instance
      text = "A patient with chest pain had hyperthyroidism."
      entity_1 = "chest pain"
      entity_2 = "hyperthyroidism"
      
      # Store annotated data information in the correct format
      annotated_data = {"sentence": text, "entities": [{"label": 'SYMPTOM', "value": 'chest pain'}, {"label": 'DISEASE', "value": 'hyperthyroidism'}]}
      
      # Extract start and end characters of each entity
      entity_1_start_char = text.find(entity_1)
      entity_1_end_char = entity_1_start_char + len(entity_1)
      entity_2_start_char = text.find(entity_2)
      entity_2_end_char = entity_2_start_char + len(entity_2)
      
      # Store the same input information in the proper format for training
      training_data = [(text, {"entities": [(entity_1_start_char,entity_1_end_char,"SYMPTOM"),
                                      (entity_2_start_char,entity_2_end_char,"DISEASE")]})]
      print(training_data)

      #

      example_text = 'A patient with chest pain had hyperthyroidism.'training_data = [(example_text, {'entities': [(15, 25, 'SYMPTOM'), (30, 45, 'DISEASE')]})]
      
      all_examples = []
      
      # Iterate through text and annotations and convert text to a Doc container
      
      for text, entities in training_data:
      doc = nlp(text)
      
      # Create an Example object from the doc contianer and annotations
      
      example_sentence = Example.from_dict(doc, entities)
      print(example_sentence.to_dict(), "\n")
      
      # Append the Example object to the list of all examples
      
      all_examples.append(example_sentence)
      
      print("Number of formatted training data: ", len(all_examples))

  
- Training spacy model for NER task
  1. Annotate and prepare input data
  2. Disable other pipeline components
  3. Train a model for a few epochs
  4. Evaluate model performance


- Disabling all other pipeline components:
      other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
      nlp.disable_pipes(*other_pipes)

- Creating 'Optimizer' object to update the model weights:
      optimizer = nlp.create_optimizer()

      losses = {}

      for i in range(epochs):
        random.shuffle(training.data)

        for text, annotation in training_data:
          doc = nlp.make_doc(text)
          example = Example.from_dict(doc, annotation)
          nlp.update([example],sgd = optimizer, losses=losses)

- Saving a trained model :

      ner = nlp.get_pipe('ner')
      ner.to_disk('<ner model name>')

- Load the saved model :
      ner = nlp.create_pipe('ner')
      ner.from_disk('<ner model name>')
      ner.add_pipe(ner, '<ner model name>')      


# Week 5 : Text classification

- Text classification into given set of labels
- Supervised,using training data
- News, sentiment analysis (opinion mining), email filtering
- Types on the basis of content:
  - content based : priority is given to topic weights
  - request based : user behaviour

- Automations:
  - Supervised (Classification and Regression) : prelabled data
  - Unsupervised : Based on pattern mining and finding latent structures in data

- Classification techniques:
(characteristic : Low frequency highly dimensional data)
  - Decision Trees
  - Pattern (Rule based)
  - SVM : optimal boundaries
  - Bayesian (Generative) Classifiers : probabilistic classifier based on modeling the underlying word features in different classes

### Mutltinomial Naive Bayes:
- Assumption: Probabilities of occurrence of the different terms are independent of one another
  - P(A|B) = P(B|A) x P(A) / P(B)
  - Posterior = prior x likelihood / evidence
  - Laplace correction : A smoothing technique to avoid frequency based 0 prob.(sparse matrix) A small sample correction (pseudo-count alpha is added)
    - theta(i) = xi + alpha / N + (alpha X d)

### Evaluation:

- Cross Validation : Model validation tech
  - Help us evaluate the quality of the model
  - To select the model that will perfrom best on unseen data
  - to avoid overfitting and underfitting the data
  - test / train split or K-fold validation (k=5, generally)



# Large language models:

- Building Blocks:
  - Text pre-processing : raw text data into std format, Tokenization -> Stop Word Removal -> Lemmatization

  - Text representation :
    - BOW (matrix of word counts)
    -  Word embeddings (semantic representation -> word weights -> relationship modeling)
  - Pre-Training
  - Fine-Tuning:
    - Transfer learning : (N-shot)
      - Zero-shot : no task specific data
      - Few-shot : little task specific data
      - Multi-shot : relatively more training data


## Pretraining to build LLMs:

- Generative Pre-Training:
  - Input data of text tokens -> trained to predict the tokens within the dataset
    - Next Word Prediction :
      - Supervised learning (trained on Input / output pairs)
      - predicts next word, generated coherent text
    - Masked Language modeling:
      - Hides a selective word
      - trained model predicts the masked word

## Transformer:

- Part of pre training
- Attention is all you need

- Architecture:
  - Long range relaiton ship between the words to generate coherent text
  - Components:
    - Pre-processing
    - Positinal encoding : understanding distant words
    - encoders : attention mechanism and neural network
    - decoders
  - Challenge: Long range dependency
  - attention: focus on different parts of the input

  - Tranformers process mutiple part simultaneously
  - Faster processing

### Attention mechanisms:

- Understand complex structures
- focus on important words
  - Self attention:
    - weighs importance of each word
    - captures long range dependencies
    - in a group conversation, evaluating each person's words and comparing their relevance
    - combine for a more comprehensive understanding for the conversation
  

  - Muti-head
    - Splits the input into multiple heads focusing on different aspects of the relationships between words
    - for instance: different aspects of the conversation, speaker's emotion, primary topic, related side-topic

    - The boy wnet to the store to buy some groceries and he found a discount on his favourite cereal
      - attention: boy, store, groceries, discount
      - self-attention: boy, he - same person
      - multi-head : character, Action, things involved

### Advanced Fine tuning:

- Reinforcement learning through Human feedback:
  - General purpose data lacks quality : noise, errors, reduced accuracy
  - Reviewed by human

### Data concerns / considerations

1. Data volume
2. Data quality (Labeled data)
3. Data Bias: Societal stereotypes, evaluate bias mitigation techniques
4. Privacy

## check LLM
---


# Week 6 :

## 1. Topic Modeling

- Intro to topic modeling:
  - BOW : numeric frequencies of the words
  - Document term Matrix :
    - Frequency of terms in a collection of documents
    - each row represents a document
    - Sparse, Cannot capture Latent variables (semantics)

- Topic Modeling : The process of learning, recognizing, and extracting hidden topics accross a collection of documents
  - Idea: Each docmuent consists of a mixture of topics, each topic consists of a collection of words
  - Types:
    1. LSI(LSA): Latent semantic indexing
    2. pLSA: probablistic latent semantic analysis
    3. LDA: Latent Dirichlet allocation
    4. lda2vec: LDA + Word2vec

- **Latent Semantic Analysis**:
  - oldest / simplest
  - reducing the word space dimensionality
  - Representation:
    - Term Document Matrix (TDM / DTM)
    - The matrix is decomposed using SVD (singular value decomposition)
      - Find the best approximation of the data points using fewer dimensions
      - Identify and order the dimensions along which datapoints exhibit the most variation
      - [A] = [U][S][V]^T
        - U is the left singular vector of words (relation between documents and topics)
        - S weights on the diagonal (scales the matrix by their corpus strength)
        - V is the right singular vector of documents (Models the term 'Topic relationship')
      - **Truncated SVD** : Reduces dimensionality by selecting only:
        - K largest S values
        - only K columns of U and V (K is the hyper-parameter that we can adjust to select the nnumber of topics we want to find)
        - see the LDA code
  - Strength : Noise removal, dimensionality reduction, captures semantic relation
  - Weakness : Interpretability (topics are word vectors with both, positive and negative direction), Evaluation

- **Probablistic Latent Semantic Analysis**:
  - probablistic method instead of SVD
  - probablity of a word W appearing in a document D as a mixture of conditionally independent multinomial distribution (trained via expextation maximization algorithm as P(Z is hidden) that involved topics
  - Core Idea: to find a probablistic model with latent topics that can generate the data we observe in the TDM matrix
  - Hyperparamter : Number of topics

  - Formula : P(D,W) = P(D) Sigma(Z) P(Z|D) P(W|Z)
  - d-> document, Z->Topic, W->Word
  - Strength: Models can be compared using the probablities assigned to the new documents, represented with positve topic assignment
  - Weakness: Computational complexity, does not yeild a generative model for other documents

  - Implementation: Equivalent to Non-Negative Matrix Factorization using 'beta-loss='kullback-leibler'' convergence

- **Latent Dirichlet Allocation (LDA)**:
  - Extends pLSA by adding a generative process
  - Is a generative model, produces hierarchical bayesian model:
    - Assumptions:
      - topics are probablity distribution over words
      - documents are probability over topics
      - topics follow sparse dirichlet distribution
  - Also has a variant to include metadata (authors,imagedata,etc)
  - Process:
    - We take a random sample of topics of a particular document with dirichlet distribution


## Advanced topic modeling : Neural relational topic models

https://www.youtube.com/watch?v=ykk-FUoDt74







# 3. Sentiment analysis with python

Sentiment analyis : opinion mining

3 elements:
  - opinion (polarity : positive, neutral, negative) / emotion (joy, surprise, disgust)
  - Subject
  - Opinion holder / entity holding the opinion

Used In
  - Social media monitoring
  - Brand Monitoring
  - Customer service
  - Market research / Analysis


1. Types and approaches:
  - Levels of granularity:
    1. Document level (whole review)
    2. Sentence level (opinion in each sentence)
    3. Aspect level (different features of the product)

  - Types of algorithms:
    - Rule / lexicon based (nice +2, good+1,etc.) : Matches the words in the lexicon, then averages or sums the total (total valence)
      - relies on dictionaries
      - different words might have different polarity in different context
      - fast

    - Automatic / Machine learning : using historical data, predicting the sentiment of a new piece of text
      - relies on historical data
      - takes time to train

    - Hybrid is the best (mostly)

              Calculating the total Valence:

              from textblob import TextBlob

              # returns a tuple (polarity(-1,1),subjectivity(0,1))
              my_valence = TextBlob(text)
              my_valence.sentiment
  
2. Word Cloud:
 - Size: corresponds to the frequency of the word

          from wordcloud import WordCloud
          import matplotlib.pyplot as plt

          # To see all functions:
          ?WordCloud

          my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(text)

          # can also specify to remove stopwords

          plt.imshow(cloud_two, interpolation = 'bilinear')

          plt.axis('off')
          plt.show()

3. Bag-of-Words:

       import pandas as pd
       from sklearn.feature_extraction.text import CountVectorizer

       vect = CountVectorizer(max_features=1000)
       vect.fit(data.review)
       X = vect.transform(data.review)

       # creates a sparse matrix, to view need to convert to a dense array

       my_array = X.toarray()
       X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

- N grams: Context matters, Unigrams, BIgrams, TRIgrams, N-grams
  - Use grid search to find the best model to fit as risk of overfitting increases
  - use max_features to define the length of the vocab

        vect = CountVectorizer(ngram_range=(min_n, max_n))

        # vocab size:
        CountVectorizer(max_features,
         max_df(ignore terms with higher than specified freq),
         min_df(can be integer, float))

- Building new features from Text: Enriching the dataset with a sentiment
  - Tokenizing a string

        from nltk import word_tokenize
        anna_k = text
        word_tokenize(anna_k)

        # tokens from a column

        word_tokens = [word_tokenize(review) for review in reviews.review]

        len_tokens = []

        for i in range(len(word_tokens[i])):
          len_tokens.append(len(word_tokens[i]))

        new_column['n_tokens'] = len_tokens

        # punctuation signs can tell how emotionally charged a review is

- Guessing the language:

      from langdetect import detect_langs
      foreign = 'dfdfb'

      detect_langs(foreign)

      # returns a list of languages

      for row in range(len(reviews)):
        languages.append(detect_langs(reviews.iloc[row,1]))

      languages = [str(lang).split(':')[0][1:] for lang in languages]

- Stop Words:
  - occur too frequently and are not informative
  - from wordclouds

        from wordcloud import WordCloud, STOPWORDS
        import matplotlib.pyplot as plt

        # define the stopwrods list
        my_stopwords = set(STOPWORDS)
        my_stopwords.update(['movie','movies'])

        my_cloud = WordCloud(backgrouund_color='white',stopwords=my_stopwords).generate(name_string)

  - from BOW:
        from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

        # defining the set of stopwords
        my_stop_words = ENGLISH_STOP_WORDS.union([mine])

        vect = CountVectorizer(stop_words= my_stop_words)
        vect.fit(movies.review)
        vect.transform(movies.review)

  - Capturing a Token Pattern
    - my_string.isalpha()
    - .isdigit()
    - .isalnum()
    - regex
          cleaned tokens = [[word for word in item if word.isapha()]for item in words_tokens]

          vect = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(tweets.text)
  
  - stemming and lemmatization:
    - stemming: words to root forms even if the stem is not valid in the root language
          # stemming:

          from nltk.stem import PorterStemmer
          porter = PorterStemmer()

          porter.stem('wonderful')

          # other languages

          from nltk.stem.snowball import SnowballStemmer
          dutchstemmer = SnowballStemmer('dutch')

          dutchstemmer.stem(word)

          # stemming can only be done on words
          
          tokens = word_tokenize(text)
          stemmed_tokens = [porter.stem(token) for token in tokens]

    - Lemma: valid roots, requires a pos
          # lemmatizer

          from nltk.stem import WordNetLemmatizer

          WNlemmatizer = WordNetLemmatizer()
          WNlemmatizer.lemmatize(word, pos='a')

  - TFIdf:
    - automatically penalizes stopwords
    - also producues a sparse matrix initially (only non zero values)

          from sklearn.feature_extraction.text import TfidfVectorizer

          # arguments = max_features, ngram_range, stop_words, token_pattern, max_df, min_df

          vect = TfidfVectorizer(mex_features=100).fit(Tweets.text)
          X = vect.transform(tweets.text)

          x_df = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())


3. Final sentiment prediction using machine learning:

  - Classification Problem
    - Logistic Regression:
      - Sigmoid function (0,1)
      - Probability (sentiment = positive|review)
             from sklearn.linear_model import LogisticRegression

             log_reg = LogisticRegression().fit(X,y)

             # Model performance : accuracy

             score = log_reg.score(X,y)

             # score gives different metrics for different models

             from sklearn.metrics import accuracy_score
             y_predicted = log_reg.predict(X)
             accuracy = accuracy_score(y,y_predicted)

      - Train_test split, Confusion Matrix
             From sklearn.model_selection import train_test_split

             X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=0.2, stratify=y)

             # stratify=y, proportion same as the given column in split

             # confusion matrix
             [True +, False +]
             [False -, True -]

             # from sklearn.metrics import confusion_matrix

             print(confusion_matrix(y_test, y_predicted)/len(y_test))
             
      - Complex models that captures the noise in the datset leads to overfitting

      - Regularization: way to penalise the models
        - applied by delfault
        - uses L2 penalty : shrinks all the coeffs towards 0, higher C-> less regularization
               LogisticRegression(penalty='l2',C=1.0)

      - Predicting the probability rather than the class:
            
             y_probab = log_reg.predict_proba(X_test)
             
             # produces an array of probabilites with prob of each class, i.e 0,1,2

             # default is 0.5 split, however, the probability threshold should depend on the proportion of classes in data







      



      




  










# Datacamp 4: Spoken language processing

## Introduction to audio data in python:
- processing audio files:
      import wave
      good_morning = wave.open('good-morning.wav','r')

      framerate_gm = good_morning.getframerate()

      # audiofile duration:
      duration = len(good_morning)/framerate_gm

      # convert wave to bytes:
      soundawave_gm = good_morning.readframes(-1)

      # from byte to integers
      import numpy to np
      
      signal_gm = np.frombuffer(soundwave_gm, dtype='int16')

      #Finding Soundwave stames : np.linspace gives equally spaced values between start and stop

      time_gm = np.linspace(start=0, stop=len(soundwave_gm)/framerate_gm, num=len(soundwave_gm))

- Visualizing soundwaves:
      import matplot.pyplot as plt

      plt.title('Good afternoon vs good morning')
      plt.xlabel('Time(seconds)')
      plt.ylabel('Amplitude')

      plt.plot('time_ga',soundwave_ga, label='Good Afternoon')
      plt.plot(time_gm, soundwave_gm, label='Good Morning', alpha=0.5)

      plt.legend()
      plt.show()

- Speech recognition python libraries:(CMU Spinx, Kaldi, speechRecognition)
      # Installing the library
      pip install SpeechRecognition

      # SpeechRecognition
      import speech_recognition as sr
      recognizer = sr.Recognizer()

      # energy threshold (silent=100)
      recognizer.energy_threshold = 300

      # using api's to convert the audio file to text:
      recognize_bing, recognize_google, recognize_google, recognize_google_cloud

      # transcribe using API:
      recognizer.recognize_google(audio_data=audio_file, lang='en-US')

- Audio files direclty using wav is saved as  audiofile and needs to be converted to audiodata to be used by recognize_google()

      # convert from audiofile to audiodata
      with clean_support_call as source:
        clean_support_call_audio = recognizer.record(source)

      recognizer.recognize_google(audio_data=clean_support_call_audio)

      # record parameters:

      recognizer.record(source, duration = (2.0 in seconds), offeset = 5.0)

- Different kinds of audio:
      # show all parameter

      with leopard_roar as source:
        leopard_roar_audio = recognizer.record(source)

      recognizer.recognize_google(leopard_roar_audio, show_all=True)

      # will give a list of all the possible audios

      # multiple audio files:

      speakers = [sr.audiofile('s1.wav'),sr.audiofile('s2.wav')]

      for i, speaker in enumerate(speakers):
        with speaker as source:
          speaker_audio = recognizer.record(source)

- Noisy audio:
      # using -> adjust_for_ambient_noise(source, duration)

      with noisy_support_call as source:
        recognizer.adjust_for_ambient(source, duration=0.5)
      
      noisy_support_call_audio = recognizer.record(source)

- Pydub
      # pip install pydub : works for wav
      - for mp3: ffmpeg via ffmpeg.org

      from pydub import AudioSegment

      wav_file = AudioSegment.from_file(file='.wav', format='wav')

      - creates a pydup.audiosegment file

      # Playing a wav file:

      pip install simpleaudio

      from pydub.playback import play

      play(wav_file)

      # Audio parameters:

      wav_file.channels (mono,stereo)
      wav_file.frame_rate
      wav_file.sample_width(no of bytes)
      wav_file.max(maximum amplitude)
      len(wav_file): length of audiofile in milliseconds

      # changing parameters:

      .set_sample_width(1)
      .set_frame_rate(16000)
      .set_channels(1)

      # increasing the volume of audio segments:
      louder_wav_file = wav_file + 10

- Normalising the audio:
      from pydub import AudioSegment
      from pydub.effects import normalise
      from pydub.playback import play

      - to either boost or reduce audio levels to match the audio levels of the entire clip

      loud_quiet = AudioSegment.from_file('loud_quiet.wav')
      normalised = normalize(loud_quiet)

      # Remixing audio segments

      removing static as time is measured in ms

      no_static = static_check[5000:]

      wave_3 = wav_2 + wav_1

      # combining scales the parameters to higher quality audiofile
      # Splitting audio from stereo to mono

      phono_call_channels = phone_call.split_to_mono()

      phone_call_channels[0], phone_call_channels[1]

      # converting and exporting audio signals

      louder_wav_file.export(out_f='.wav', format='wav')

- Reformatting mutiple audio files
      def make_wav(wrong_folder_path, right_folder_path):
      # Loop through wrongly formatted files:

      for file in os.scandir(wrong_folder_path):

        if file.path.endswith('.mp3') or file.path.endswith('.flac'):
          out_file = right_folder_path + os.path.splitext(os.path.basename(file.path))[0] + '.wav'

      AudioSegment.from_file(file.path).export(out_file, format='wav')

- Spoken Launguage processing pipeline:
      def convert_to_wav(filename):
      """Takes an audio file of non .wav format and converts to .wav"""
      
      # Import audio file
      audio = AudioSegment.from_file(filename)
      
      # Create new filename
      new_filename = filename.split(".")[0] + ".wav"
      
      # Export file as .wav
      audio.export(new_filename, format='wav')
      print(f"Converting {filename} to {new_filename}...")
      
      # Test the function
      convert_to_wav('call_1.mp3')


      def show_pydub_stats(filename):
      
      """Returns different audio attributes related to an audio file."""
      # Create AudioSegment instance
      audio_segment = AudioSegment.from_file(filename)
      
      # Print audio attributes and return AudioSegment instance
      print(f"Channels: {audio_segment.channels}")
      print(f"Sample width: {audio_segment.sample_width}")
      print(f"Frame rate (sample rate): {audio_segment.frame_rate}")
      print(f"Frame width: {audio_segment.frame_width}")
      print(f"Length (ms): {len(audio_segment)}")
      return audio_segment

      # Try the function
      call_1_audio_segment = show_pydub_stats('call_1.wav')


      def transcribe_audio(filename):
      """Takes a .wav format audio file and transcribes it to text."""
      # Setup a recognizer instance
      recognizer = sr.Recognizer()
      
      # Import the audio file and convert to audio data
      audio_file = sr.AudioFile(filename)
      with audio_file as source:
        audio_data = recognizer.record(source)
      
      # Return the transcribed text
      return recognizer.recognize_google(audio_data)

      # Test the function
      print(transcribe_audio('call_1.wav'))

- Sentiment analysis on spoken language
      pip install nltk

      import nltk
      nltk.download('punkt')
      nltk.download('vader_lexicon')

      # import sentiment analysis class

      from nltk.sentiment.vader import SentimentIntensityAnalyzer

      # create sentiment analysis instance
      sid = SentimentIntensityAnalyzer()

      print(sid.polarity_scores('text'))

      from nltk.tokenize import sent_tokenize

      for sentence in sent_tokenize(text):
        print(sentence)
        print(sid.polarity_scores(sentence))

- Named Entity recognition transcribed text

      nlp = spacy.load('en_core_web_sm')
      doc = nlp('text')

      for sentences in doc.sents:
        print(sents)

      for entity in doc.ents:
        print(entity.text, entity.label_)

      - custom named entities:

      from spacy.pipeline import EntityRuler
      print(nlp.pipeline)

- Classifying transcribed speech using sklearn:

      import os
      post_purchase_audio = os.listdir('post_purchase')

      # Build the text_classifier as an sklearn pipeline
      text_classifier = Pipeline([
          ('vectorizer', CountVectorizer()),
          ('tfidf', TfidfTransformer),
          ('classifier', MultinomialNB()),
      ])

      # Fit the classifier pipeline on the training data
      text_classifier.fit(train_df.text, train_df.label)
