In [54]:
import numpy as np
import pandas as pd

In [55]:
doc = '''In the era of digital transformation, data science has emerged as a transformative force, reshaping industries and decision-making processes across the globe. At its core, data science is the amalgamation of statistical methodologies, computational techniques, and domain expertise aimed at extracting meaningful insights from vast and complex datasets. As organizations increasingly recognize the value of data, the demand for skilled data scientists has surged. Data science encompasses a broad spectrum, from data collection and cleaning to advanced machine learning algorithms and predictive modeling. This article delves into the foundational concepts of data science, the key steps in its lifecycle, the tools and technologies driving its advancements, and its myriad applications across diverse sectors.
The data science lifecycle comprises a series of interconnected stages, each crucial for the effective extraction of knowledge from data. The journey begins with data collection and preprocessing, where raw data is gathered and refined to ensure accuracy and relevance. Exploratory Data Analysis (EDA) follows, a stage characterized by a deep dive into the dataset to identify patterns, trends, and outliers. The subsequent steps involve modeling and algorithm selection, where machine learning techniques are applied to build predictive models. Ensuring the reliability of these models is achieved through rigorous evaluation and validation processes. Feature engineering and selection further optimize model performance. The final stages involve the deployment of models into real-world scenarios and ongoing maintenance to adapt to changing data patterns and external factors.
A robust data science ecosystem is powered by a suite of programming languages, libraries, and frameworks. Python and R stand out as the primary languages, celebrated for their versatility and extensive libraries. Libraries such as NumPy and Pandas facilitate data manipulation, while Scikit-Learn and TensorFlow offer a comprehensive set of tools for machine learning tasks. Data visualization, a critical aspect of data science, is made accessible through tools like Matplotlib and Tableau. In the realm of big data, technologies like Hadoop and Spark provide scalable solutions for handling and processing massive datasets. This dynamic toolkit empowers data scientists to navigate the intricacies of data with efficiency and precision.
The impact of data science reverberates across industries, introducing innovative solutions and catalyzing advancements. In healthcare, data science plays a pivotal role in disease diagnosis, treatment personalization, and public health management. The finance sector leverages data science for fraud detection, risk assessment, and algorithmic trading, enhancing operational efficiency and security. Marketing and e-commerce thrive on data-driven insights, enabling personalized customer experiences and targeted campaigns. Social media analysis, including sentiment analysis, informs brand strategies and public perception. In manufacturing, predictive maintenance powered by data science reduces downtime and optimizes machinery performance. These diverse applications underscore the versatility and transformative potential of data science.
As data science continues to evolve, its trajectory points towards an even more data-centric future. The amalgamation of artificial intelligence and machine learning with data science is poised to unlock unprecedented possibilities. Ethical considerations, privacy concerns, and responsible data usage are becoming central themes in the data science narrative. As industries increasingly integrate data-driven approaches into their operations, the demand for skilled professionals will persist. In conclusion, the journey through the data science landscape—from its foundational concepts to applications and future prospects—reveals a field that not only interprets the language of data but also shapes the future of industries and societies at large. The data-driven era has just begun, and the possibilities it presents are limited only by our imagination and ethical considerations.
'''


In [56]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [57]:
tokenizer = Tokenizer()

In [58]:
tokenizer.fit_on_texts([doc])

In [59]:
len(tokenizer.word_index)

309

In [60]:
tokenizer.word_index

{'and': 1,
 'data': 2,
 'the': 3,
 'of': 4,
 'science': 5,
 'a': 6,
 'to': 7,
 'in': 8,
 'for': 9,
 'as': 10,
 'its': 11,
 'is': 12,
 'industries': 13,
 'machine': 14,
 'learning': 15,
 'into': 16,
 'by': 17,
 'has': 18,
 'across': 19,
 'at': 20,
 'from': 21,
 'predictive': 22,
 'tools': 23,
 'applications': 24,
 'with': 25,
 'analysis': 26,
 'are': 27,
 'models': 28,
 'through': 29,
 'libraries': 30,
 'driven': 31,
 'future': 32,
 'era': 33,
 'transformative': 34,
 'processes': 35,
 'amalgamation': 36,
 'techniques': 37,
 'insights': 38,
 'datasets': 39,
 'increasingly': 40,
 'demand': 41,
 'skilled': 42,
 'scientists': 43,
 'collection': 44,
 'modeling': 45,
 'this': 46,
 'foundational': 47,
 'concepts': 48,
 'steps': 49,
 'lifecycle': 50,
 'technologies': 51,
 'advancements': 52,
 'diverse': 53,
 'stages': 54,
 'journey': 55,
 'where': 56,
 'patterns': 57,
 'involve': 58,
 'selection': 59,
 'these': 60,
 'performance': 61,
 'maintenance': 62,
 'powered': 63,
 'languages': 64,
 'thei

In [61]:
tokenizer.word_counts

OrderedDict([('in', 8),
             ('the', 31),
             ('era', 2),
             ('of', 18),
             ('digital', 1),
             ('transformation', 1),
             ('data', 35),
             ('science', 16),
             ('has', 3),
             ('emerged', 1),
             ('as', 6),
             ('a', 11),
             ('transformative', 2),
             ('force', 1),
             ('reshaping', 1),
             ('industries', 4),
             ('and', 40),
             ('decision', 1),
             ('making', 1),
             ('processes', 2),
             ('across', 3),
             ('globe', 1),
             ('at', 3),
             ('its', 6),
             ('core', 1),
             ('is', 6),
             ('amalgamation', 2),
             ('statistical', 1),
             ('methodologies', 1),
             ('computational', 1),
             ('techniques', 2),
             ('domain', 1),
             ('expertise', 1),
             ('aimed', 1),
             ('extracting'

In [62]:
input_sequences = []
for sentence in doc.split('\n'):
  tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]

  for i in range(1,len(tokenized_sentence)):
    input_sequences.append(tokenized_sentence[:i+1])

In [63]:
input_sequences

[[8, 3],
 [8, 3, 33],
 [8, 3, 33, 4],
 [8, 3, 33, 4, 75],
 [8, 3, 33, 4, 75, 76],
 [8, 3, 33, 4, 75, 76, 2],
 [8, 3, 33, 4, 75, 76, 2, 5],
 [8, 3, 33, 4, 75, 76, 2, 5, 18],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78, 79],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78, 79, 13],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78, 79, 13, 1],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78, 79, 13, 1, 80],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78, 79, 13, 1, 80, 81],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78, 79, 13, 1, 80, 81, 35],
 [8, 3, 33, 4, 75, 76, 2, 5, 18, 77, 10, 6, 34, 78, 79, 13, 1, 80, 81, 35, 19],
 [8,
  3,
  33,
  4,
  75,
  76,
  2,
  5,
  18,
  77,
  10,
  6,
  34,
  78,
  79,
  13,
  1,
  80,
  81,
  35,
  19,
  3

In [64]:
max_len = max([len(x) for x in input_sequences])

In [65]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_input_sequences = pad_sequences(input_sequences, maxlen = max_len, padding='pre')

In [66]:
padded_input_sequences

array([[  0,   0,   0, ...,   0,   8,   3],
       [  0,   0,   0, ...,   8,   3,  33],
       [  0,   0,   0, ...,   3,  33,   4],
       ...,
       [  0,   0,  10, ..., 308, 309,   1],
       [  0,  10,   2, ..., 309,   1,  72],
       [ 10,   2,   5, ...,   1,  72,  73]])

In [67]:
X = padded_input_sequences[:,:-1]

In [68]:
y = padded_input_sequences[:,-1]

In [69]:
X.shape

(573, 125)

In [70]:
y.shape

(573,)

In [71]:
from tensorflow.keras.utils import to_categorical
y = to_categorical(y,num_classes=310)


In [72]:
y.shape

(573, 310)

In [73]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [74]:
model = Sequential()
model.add(Embedding(310, 100, input_length=125))
model.add(LSTM(150))
model.add(Dense(310, activation='softmax'))

In [75]:
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [76]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 125, 100)          31000     
                                                                 
 lstm_2 (LSTM)               (None, 150)               150600    
                                                                 
 dense_2 (Dense)             (None, 310)               46810     
                                                                 
Total params: 228410 (892.23 KB)
Trainable params: 228410 (892.23 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [77]:

# Model training
model.fit(X, y, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x25cb912d6d0>

In [78]:
import time
text = "In the era of digital"

for i in range(5):
  # tokenize
  token_text = tokenizer.texts_to_sequences([text])[0]
  # padding
  padded_token_text = pad_sequences([token_text], maxlen=125, padding='pre')
  # predict
  pos = np.argmax(model.predict(padded_token_text))

  for word,index in tokenizer.word_index.items():
    if index == pos:
      text = text + " " + word
      print(text)
      time.sleep(2)

In the era of digital transformation
In the era of digital transformation data
In the era of digital transformation data science
In the era of digital transformation data science has
In the era of digital transformation data science has emerged
